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FOREWORD 


The  Personnel  Utilization  Technical  Area  of  the  Army  Research  In¬ 
stitute  for  the  Behavioral  and  Social  Sciences  (ARI)  is  concerned  with 
developing  more  effective  techniques  for  measuring  people's  abilities, 
to  aid  in  Army  job  assignment.  An  emerging  technology  which  offers 
considerable  promise  in  this  area  is  computer-based  adaptive  mental 
testing.  This  report  was  prepared  under  Army  Project  2Q162717A766, 
Manpower  Systems  Technology,  to  identify  technology  gaps  and  deficien¬ 
cies  and  to  summarize  new  trends  in  the  state  of  the  art  of  mental 
testing. 

The  report  was  prepared  while  the  author  was  a  staff  member  of 
ARI.  He  is  presently  on  the  staff  of  the  Naval  Personnel  Research  6 
Development  Center,  San  Diego,  Calif. 
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ADAPTIVE  MENTAL  TESTING:  THE  STATE  OF  THE  ART 


BRIEF 


Requirement : 

To  identify  technology  gaps  and  deficiencies  and  to  summarize  new 
trends  in  the  state  of  the  art  of  mental  testing. 


Procedure : 

Adaptive  mental  testing  is  defined  in  relation  to  conventional 
mental  testing.  The  state  of  the  art  is  assessed  for  each  of  six  re¬ 
search  issues  in  adaptive  mental  testing:  (1)  psychometric  theory; 

(2)  design  of  adaptive  tests;  (3)  scoring  adaptive  tests;  (4)  the  test¬ 
ing  medium;  (5)  item  pool  development;  and  (6)  advances  in  measurement 
technology . 


Findings: 

Specific  research  requirements  are  identified  for  each  research 
issue  in  adaptive  mental  testing.  Discussion  of  these  requirements 
is  also  provided. 


Utilization  of  Findings: 

This  research  forms  a  basis  for  designing  a  research  and  develop¬ 
ment  program  for  application  of  adaptive  mental  testing  technology  to 
military  applicant  selection  and  job  assignment. 
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ADAPTIVE  MENTAL  TESTING:  THE  STATE  OF  THE  ART 


INTRODUCTION 

The  measurement  of  psychological  traits  is  usually  accomplished  by 
observing  the  responses  of  examinees  to  selected  test  items.  For  some 
traits,  notably  the  ability/aptitude  traits  assessed  during  personnel 
selection  and  classification,  all  examinees  are  required  to  answer  a 
common  set  of  items,  and  the  test  score  is  a  linear  composite  of  the 
dichotomous  item  scores.  This  test  score  is  used  as  an  index  of  indi¬ 
vidual  differences  to  differentiate  among  the  persons  tested. 

It  has  long  been  known  that  administering  the  same  test  items  to 
all  persons — as  is  done  in  conventional  group  tests — provides  less  than 
optimal  discriminability ,  and  that  the  ability  to  differentiate  accu¬ 
rately  among  persons  of  varying  trait  status  could  be  enhanced  by  indi¬ 
vidually  tailoring  the  test  items  to  the  status  of  the  examinee.  In 
ability  measurement  terms,  this  connotes  dynamically  tailoring  test  item 
difficulty  to  the  ability  level  of  the  individual.  A  test  that  proceeds 
in  this  fashion  is  called  an  adaptive,  or  tailored,  test  (Weiss  &  Betz, 
1973;  Wood,  1973) .  Adaptive  tests  have  striking  psychometric  advantages 
over  conventional  tests  under  certain  circumstances,  and  they  have  aroused 
considerable  interest  among  test  theoreticians. 

The  development  of  adaptive  testing  has  been  motivated  largely  by 
recognizing  that  conventional  group  ability  tests  do  not  measure  indi¬ 
vidual  differences  with  equal  precision  at  all  levels  of  ability;  this 
is  because  accuracy  and  precision  of  measurement  are  in  part  a  function 
of  the  appropriateness  of  test  item  difficulty  to  the  ability  of  tne  in¬ 
dividual  being  measured. 

To  measure  with  high  precision  at  all  levels  of  ability  requires 
tailoring  the  test — by  either  item  difficulty  or  test  length,  or  both — 
to  the  individual.  Since  ability  is  unknown  at  the  outset  of  testing, 
the  tailoring  process  must  be  done  during  the  test;  hence  the  require¬ 
ment  for  adaptive  ability  testing.  This  is  done  by  choosing  test  items 
sequentially,  during  the  test,  to  adapt  the  test  to  the  examinee’s 
ability  as  shown  by  responses  to  earlier  test  items.  This  can  be  done 
by  a  human  examiner,  using  paper-and-pencil  tests  with  special  instruc¬ 
tions,  or  by  means  of  a  mechanical  testing  device.  The  device  most  com¬ 
monly  used  is  an  interactive  computer  terminal. 

The  motivation  for  adaptive  testing  is  that  it  should  permit  measur¬ 
ing  ability  with  higher  and  more  equal  precision  throughout  a  wide  ability 
range  than  can  conventional  group  tests  in  which  all  persons  answer  the 
same  test  items.  In  terms  of  classical  psychometric  indices,  improved 
measurement  in  that  sense  should  be  accompanied  by  corresponding  improve¬ 
ments  in  reliability  and  in  external  validity.  In  addition  to  the  psycho¬ 
metric  benefits,  there  are  potential  psychological  benefits  to  examinees 


1 


in  the  reduction  of  frustration  or  boredom  resulting  from  adapting  test 
difficulty  to  the  individual. 


The  rationale  behind  adaptive  testing  has  existed  for  years.  The 
Stanford-Binet  intelligence  test  is  an  adaptive  test,  administered  per¬ 
sonally  by  a  skilled  examiner.  Mass  testing  using  adaptive  methods 
would  make  such  personal  administration  impractical,  however.  The 
development  of  adaptive  testing  awaited  the  availability  of  testing 
media  that  would  permit  widespread  use  of  adaptive  tests  on  a  fairly 
large  scale.  A  number  of  problems — psychometric  and  technological — 
had  to  be  solved  before  adaptive  testing  could  be  practical  on  a  large 
scale.  This  paper  contains  a  review  of  some  of  those  problems,  and  a 
summary  of  the  state  of  the  art  in  research  addressing  them. 


BACKGROUND 

Conventional  Test  Design 

Conventional  group  administrable  tests  of  psychological  variables, 
such  as  mental  abilities,  involve  administering  of  a  common  set  of  items 
to  all  examinees.  The  total  score  on  such  tests,  usually  the  number  cor¬ 
rect  or  some  transformation  thereof,  is  used  to  index  individual  differ¬ 
ences  on  the  variable  being  measured.  This  procedure  has  been  sanctified 
by  longstanding  practice  and  by  empirical  usefulness,  but  it  has  disad¬ 
vantages  as  a  measurement  technique. 

To  construct  a  conventional  test,  the  test  designer  chooses  some 
subset  of  items  from  a  larger  pool  of  available  items  known  to  measure 
the  variable  of  interest.  Since  the  items  in  the  pool  typically  vary 
in  their  psychometric  properties — particularly  in  their  difficulty — the 
test  designer  must  decide  what  configuration  of  these  item  psychometric 
properties  best  suits  the  test's  purpose.  There  are  two  extreme  ration¬ 
ales  to  guide  that  decision.  One  rationale  is  to  choose  items  that  are 
highly  homogeneous  in  item  difficulty.  A  test  so  constructed,  called  a 
"peaked"  test,  will  discriminate  very  effectively  over  a  narrow  range  of 
the  variable,  but  will  discriminate  poorly  outside  that  range.  The 
purpose  of  a  peaked  test  design  is  to  make  fine  discriminations  in  the 
vicinity  of  a  cutting  point;  e.g.,  to  categorize  examinees  into  "go" 
and  "no-go"  groups  for  selection  purposes. 

At  the  opposite  extreme  is  the  "uniform"  test,  constructed  of  items 
that  are  heterogeneous  in  difficulty,  with  item  difficulty  parameters 
spread  over  a  wide  range.  A  uniform  test  will  discriminate  with  more 
or  less  equivalent  precision  over  a  wide  range  of  the  variable,  but 
(other  things  being  equal)  the  level  of  precision  will  be  substantially 
lower  than  that  of  the  peaked  test  at  the  latter’s  best  point.  The 
purpose  of  a  uniform  test  is  to  measure  with  equal  precision  throughout 
a  wide  range  of  the  trait;  e.g.,  to  obtain  information  on  which  to  aid 
assignment  decisions  to  jobs  requiring  varying  amounts  of  the  tested 
ability. 


In  constructing  a  conventional  test  of  given  length,  the  test  de¬ 
signer  must  choose  between  high  precision  over  a  very  narrow  range,  and 
low  to  moderate  precision  over  a  wide  range.  A  test  cannot  have  both 
high  precision  and  wide  range  unless  the  test  is  very  long  or  the  item 
difficulty  is  tailored  to  each  examinee's  level  on  the  underlying  vari¬ 
able.  The  use  of  long  tests  is  often  impractical.  The  alternative — 
tailoring  test  difficulty  to  each  examinee — represents  a  striking  de¬ 
parture  from  conventional  group  testing  practice. 


Adaptive  Test  Design 

In  an  adaptive  test,  the  test  administrator  chooses  test  items 
sequentially  during  the  test,  in  such  a  way  as  to  adapt  test  difficulty 
to  examinee  ability  as  shown  during  testing.  An  effectively  designed 
adaptive  test  can  resolve  the  dilemma  inherent  in  conventional  test 
design.  By  tailoring  tests  to  individuals,  the  adaptive  test  can  ap¬ 
proximately  achieve  the  high  point  precision  of  a  peaked  test  and  can 
extend  that  high  level  of  precision  over  the  wide  range  of  a  uniform 
test.  As  a  result,  a  well-constructed  adaptive  test  should  be  more 
broadly  applicable  than  a  conventional  test  of  comparable  item  quality 
and  test  length,  since  its  precision  characteristics  make  it  useful 
for  classification  about  one  or  many  cutting  points,  as  well  as  for 
measurement  over  a  wide  range. 

It  is  important  to  understand  how  an  adaptive  test  can  achieve 
psychometric  advantages  over  conventional  tests.  It  can  be  shown  that 
measurement  error  is  a  function  of  the  disparity  between  item  diffi¬ 
culty  and  personal  ability,  as  well  as  the  discriminating  power  of  the 
test  items  and  their  susceptibility  to  guessing.  Since  a  peaked  test 
concentrates  item  difficulty  at  a  single  ability  level,  measurement 
error  should  be  smallest  at  that  critical  level,  and  increasingly  larger 
at  ability  levels  deviant  from  the  critical  point.  In  the  case  of  a 
uniform  test,  item  difficulty  is  spread  over  a  wide  range;  consequently, 
measurement  error  tends  to  be  low  to  moderate  and  fairly  constant  over 
a  correspondingly  wide  range. 

What  is  desirable,  of  course,  is  to  achieve  small  measurement 
error  over  a  wide  range  of  the  trait  scale.  This  can  be  done  only  by 
administering  items  of  appropriate  difficulty  at  every  ability  level 
of  interest.  The  rationale  of  adaptive  testing  is  to  do  this  more 
efficiently  (i.e.,  in  fewer  items)  than  can  be  done  by  conventional 
means.  This  implies  individualized  choice  of  test  items  for  each  ex¬ 
aminee.  Administratively,  this  can  be  accomplished  (a)  by  individual 
testing  by  skilled  examiners,  (b)  by  specially  designed  group-administered 
paper-and-pencil  adaptive  tests  with  rather  complex  instructions , ^  or 


^An  example  of  this  kind  of  test  is  the  flexilevel  test  devised  by 
Lord  (1971a) . 


(c)  by  automated  testing  using  a  computer  or  a  specialized  stimulus 
programmer  to  choose  and  administer  test  items.  Research  in  adaptive 
testing  has  emphasized  computer-controlled  test  administration. 

Early  research  pertinent  to  adaptive  testing  was  reviewed  by 
Weiss  and  Betz  (1973)  ,  and  by  Wood  (1973)  .  Subsequent  research  has 
been  reviewed  by  this  writer  (McBride,  1976a).  Research  in  adaptive 
testing  has  progressed  from  exploratory  studies  of  item  branching 
tests  (e.g.,  Seeley,  Morton,  &  Anderson,  1962),  through  the  explica¬ 
tion  of  a  novel  test  theory  applicable  to  tailored  tests  (e.g.,  Lord, 
1970,  1974a),  to  the  verge  of  operational  implementation  of  a  large- 
scale  adaptive  testing  system  for  personnel  selection  (Urry,  1977b). 

From  a  psychometric  viewpoint,  adaptive  tests  are  attractive  for 
a  number  of  reasons.  Adaptive  tests  represent  a  breakthrough  in  the 
technology  of  psychological  measurement,  because  they  can  yield  more 
precise  measurement  over  a  wider  range  with  substantially  fewer  items 
than  can  conventional  tests.  In  other  words,  adaptive  tests  can  achieve 
higher  validity  of  measurement  than  comparable  conventional  tests  in  a 
given  test  length;  or,  they  can  attain  a  given  level  of  validity  in  sub¬ 
stantially  fewer  items  than  a  comparable  conventional  test  (Urry,  1974). 

Other  aspects  of  adaptive  tests  also  make  them  attractive,  par¬ 
ticularly  if  they  are  computer-administered.  Tailoring  test  difficulty 
to  examinee  ability  may  reduce  error  variance  caused  by  examinee  frus¬ 
tration,  boredom,  or  test  anxiety  (Weiss,  1974),  as  well  as  by  guessing. 
Computer  administration  and  scoring  can  reduce  human  error  in  marking 
answers,  scoring  the  tests,  and  recording  the  results.  Test  compromise 
can  be  reduced  substantially,  by  eliminating  test  booklets  (thus  negat¬ 
ing  theft)  and  by  individualizing  test  construction  (thereby  thwarting 
the  use  of  cheating  devices).  Printing,  storage,  and  handling  of  test 
booklets  and  answer  sheets  can  be  eliminated,  saving  costs. 

The  psychometric  and  practical  potential  of  adaptive  testing  makes 
it  worthy  of  research  and  development  in  the  military  manpower  setting, 
with  the  goal  of  eventual  implementation  of  an  automated  system  for 
test  administration  and  scoring,  and  personnel  selection,  classifica¬ 
tion,  and  job-choice  counseling.  Some  of  the  relevant  research  has 
already  been  done  and  has  been  reviewed  as  cited  above.  One  outcome 
of  the  completed  research  has  been  the  crystallization  of  a  number  of 
research  issues  that  need  to  be  resolved  before  deciding  whether  'c 
implement  an  adaptive  testing  system.  The  purpose  of  this  repor'.  is 
to  present  some  of  those  issues  and  to  evaluate  the  state  of  the  ;rc 
with  respect  to  their  resolution. 


RESEARCH  ISSUES 


Psychometric  Theory 

Early  adaptive  testing  research  showed  that  traditional  test 
theory  was  an  inadequate  basis  for  the  construction  and  scoring  of 
adaptive  tests  (e.g.,  Bayroff  &  Seeley,  1967).  This  was  due  to  re¬ 
quirements  for  item  parameters  that  were  invariant  with  respect  to  ex¬ 
aminee  group,  ar.d  means  of  scoring  tests  in  which  different  examinees 
answered  sets  of  items  that  differed  in  difficulty,  number,  and  other 
respects  as  well.  One  resolution  of  this  issue  was  provided  by  the 
earlier  development  of  item  response  theory  (Rasch,  1960;  Lord,  1952, 
1970,  1974a;  Birnbaum,  1968)  that  provided  the  needed  invariance 
properties  for  item  parameters  and  test  scoring  capabilities. 

Subsequent  approaches  to  adaptive  testing  were  developed  that 
did  not  depend  on  the  rather  strong  assumptions  of  item  response  theory. 
Kalisch  (1974)  and  Cliff  (1976)  both  presented  theory  and  methods  for 
adaptive  testing  that  are  not  based  on  the  stochastic  response  models 
of  item  response  theory.  Other  psychometric  bases  appropriate  for  use 
in  adaptive  testing  may  be  forthcoming.  Clearly,  one  research  issue 
to  be  addressed  is  the  adequacy  of  the  psychometric  foundation  of  any 
proposed  approach  to  the  implementation  of  adaptive  testing. 


Item  Response  Models 

Most  adaptive  testing  research  since  1968  has  used  item  response 
theory  (item  characteristic  curve,  or  latent  trait,  theory)  as  a  psy¬ 
chometric  basis.  Within  item  response  theory,  several  competing  re¬ 
sponse  models  for  dichotomously  scored  items  have  been  proposed.  These 
models  differ  in  mathematical  form  and  in  the  number  of  parameters 
needed  to  account  for  item  response  behavior .  Some  of  these  models 
include  the  one-parameter  Rasch  logistic  model  (e.g.,  Wright  &  Douglas, 
1975);  the  two-parameter  normal  ogive  model  (Lord  &  Novick,  1968);  and 
the  three-parameter  logistic  ogive  model  (Birnbaum,  1968) .  These  models 
differ  in  mathematical  complexity  and  in  the  procedures  required  to  im¬ 
plement  them  in  practice.  If  adaptive  testing  research  is  to  be  based 
on  item  response  theory,  a  consequent  research  issue  is  to  choose  from 
among  the  available  response  models  the  one  best  for  the  purpose.  The 
basis  for  such  a  choice  should  include  consideration  of  the  appropriate¬ 
ness  of  the  competing  models,  their  robustness  under  violations  of  rele¬ 
vant  assumptions,  and  the  difficulty  and  expense  of  implementing  them. 


Design  of  Adaptive  Tests 


Strategies  for  Adaptive  Testing 


Adaptive  testing  by  definition  involves  sequential  selection  of 
the  test  items  to  be  answered  by  each  examinee.  Numerous  methods  for 


sequentially  choosing  items  have  been  proposed.  These  methods,  called 
"strategies"  for  adaptive  testing,  were  reviewed  by  Weiss  (1974)  .  Since 
then,  several  new  ones  have  come  forth  (e.g.,  Cliff,  1976;  Kalisch, 

1974;  McBridge,  1976b). 

These  strategies  vary  along  a  number  of  dimensions,  including  math¬ 
ematical  elegance,  item  selection  algorithms,  scoring  methods,  and  others. 
There  is  a  clear  need  for  research  to  compare  the  various  strategies  on 
their  psychometric  and  practical  merits  to  provide  the  data  needed  to 
guide  a  choice  among  strategies . 


Test  Length 


Any  mental  test  has  some  criterion  for  test  termination — a  rule  for 
stopping.  Usually,  a  power  test  terminates  when  the  examinee  has  answered 
all  the  items  (although  a  time  limit  may  be  imposed  for  administrative 
convenience) .  Some  adaptive  testing  strategies  also  use  fixed  test 
length  as  a  stopping  rule:  Terminate  testing  when  the  examinee  has 
answered  some  fixed  number  of  items .  Other  strategies  for  adaptive 
testing,  however,  allow  test  length  to  vary  from  one  examinee  to  another 
by  basing  the  termination  decision  on  some  criterion  other  than  test 
length.  For  example,  testing  may  be  terminated  when  a  ceiling  level  of 
difficulty  has  been  identified  (e.g.,  Weiss'  (1973)  stratified  adaptive 
strategy) ,  or  when  a  prespecified  degree  of  measurement  precision  has 
apparently  been  attained  (e.g.,  Urry,  1974;  Samejima,  1977). 

The  research  issue  here  concerns  the  relative  merits  of  fixed 
length  versus  variable  length  adaptive  tests.  Is  one  alternative  gen¬ 
erally  preferable  over  the  other  or  preferable  for  some  testing  purposes 
but  not  for  others?  The  notion  of  variable  length  tests  has  some  intui¬ 
tive  appeal.  Research  is  required  to  verify  whether  variable  length 
tests  have  psychometric  and  practical  merit. 


Test  Entry  Level 

Another  aspect  of  the  design  of  adaptive  tests  is  test  entry  level — 
the  difficulty  level  of  the  first  item(s)  the  examinee  must  answer.  In 
some  cases  there  may  be  reliable  information  available  prior  to  testing 
that  would  justify  the  use  of  different  starting  points  for  different 
examinees.  For  example,  in  a  multitest  battery,  some  subtests  are  sub¬ 
stantially  intercorrelated;  an  examinee's  score  on  an  early  subtest  may 
provide  useful  data  for  choosing  entry  level  on  a  subsequent  subtest. 

The  use  of  differential  entry  levels  may  permit  us  to  improve 
measurement  accuracy  or  to  achieve  a  given  level  of  measurement  accu¬ 
racy  in  even  fewer  items  than  an  adaptive  test  that  uses  a  fixed  entry 
level.  Research  is  needed  to  determine  if  these  potential  advantages 
of  differential  test  entry  level  can  be  achieved. 
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Because  an  adaptive  test  is  fundamentally  different  from  a  conven¬ 
tional  test  in  which  everyone  answers  the  same  questions,  it  follows 
that  conventional  test  scoring  methods  may  not  be  applicable  to  adap¬ 
tive  tests.  That  is,  it  may  make  little  sense  to  score  an  adaptive 
test  by  weighting  and  summing  the  dichotomous  item  scores.  If  so,  al¬ 
ternative  scoring  methods  are  needed,  which  gives  rise  to  yet  another 
research  issue:  What  means  of  scoring  adaptive  tests  are  available, 
and  which  are  "best"  in  some  important  sense? 

A  related  issue  is  the  comparability  of  scores  on  adaptive  tests 
with  more  familiar  scores  on  standardized  conventional  tests.  Are  ap¬ 
propriate  score  equating  methods  available  for  transforming  adaptive 
test  scores  into  the  metric  of  raw  or  converted  scores  of  established 
conventional  measures  having  the  same  variables? 


The  Testing  Medium 


Conventional  ability  tests  are  typically  administered  via  paper 
and  pencil,  and  constructed  of  multiple-choice  items.  Adaptive  tests 
using  the  same  item  types  may  be  administered  individually  (a)  by  a 
skilled  examiner,  (b)  at  an  automated  testing  terminal,  perhaps  con¬ 
trolled  by  a  computer;  or  (c)  by  means  of  specially  constructed  paper- 
and-pencil  tests. 


Individual  testing  by  skilled  examiners  is  impractical  for  large- 
scale  use.  Thus,  only  automated  testing  terminals  and  specially  de¬ 
signed  paper-and-pencil  tests  merit  serious  consideration  as  potential 
media  for  adaptive  testing  on  a  large  scale.  Whether  paper-and-pencil 
adaptive  testing  is  even  feasible  is  problematic  because  of  the  require¬ 
ment  for  sequential  item  selection.  Another  research  issue,  then,  con¬ 
cerns  the  feasibility  of  group  administration  of  paper-and-pencil  adap¬ 
tive  tests. 


The  feasibility  of  automated  test  administration  is  not  in  ques¬ 
tion,  since  the  presentation  of  test  items  and  the  recording  and  process¬ 
ing  of  an  examinee's  responses  can  be  done  using  modern  computers  with 
interactive  visual  display  terminals,  such  as  teletype,  cathode  ray 
tube  (CRT) ,  or  plasma  tube  (PLATO)  terminals. 


Nevertheless,  computers  and  computer  terminals  are  presently 
relatively  expensive  compared  to  traditional  printed  test  booklets  and 
answer  sheets.  It  may  be  preferable  to  base  automated  adaptive  tests 
on  devices  that  are  somewhat  less  sophisticated  and  less  costly  than 
full-scale  computer  systems.  Still  another  research  issue  surfaces 
here:  What  alternative  devices/systems  may  be  used  for  automated 
adaptive  testing,  and  what  are  the  advantages  and  disadvantages  of 
each? 


Item  Pool  Development 


Selecting  the  items  to  constitute  an  adaptive  testing  item  pool  is 
a  somewhat  larger  undertaking  than  choosing  items  for  a  conventional 
test.  The  psychometric  criteria  for  item  selection  and  for  pool  con¬ 
struction  are  more  rigorous  than  those  for  conventional  test  design, 
and  the  item  pool  must  be  substantially  larger  than  the  length  of  any 
individualized  test  drawn  from  it.  Since  the  degree  to  which  an  adap¬ 
tive  test  realizes  its  potential  may  be  limited  by  the  size  and  quality 
of  its  item  pool,  it  is  imperative  that  research  defines  the  necessary 
or  desirable  characteristics  of  item  pools  for  adaptive  testing  and 
provides  practical  prescriptions  for  item  pool  development. 


Advances  in  Measurement  Methodology 

Adaptive  administration  of  traditional  dichotomously  scored  test 
items  promises  a  significant  gain  in  the  psychometric  efficiency  of 
measurement.  Since  adaptive  testing  research  has  stressed  the  use  of 
computer  terminals  for  test  administration,  we  should  exploit  the 
unique  capabilities  of  computers  to  control  test  situations  that  are 
vastly  different  from  the  relatively  simple  tasks  that  comprise  paper- 
and-pencil  tests.  New  approaches  to  ability  measurement  may  arise 
from  the  conjunction  of  adaptive  test  design  and  computerized  test  ad¬ 
ministration,  and  thus  a  number  of  research  issues  may  arise.  These 
issues  could  include  the  following:  How  can  the  expanded  stimulus  and 
response  modes  made  possible  by  computer  administration  be  exploited 
to  improve  the  measurement  of  traditional  ability  variables?  What  new 
variables  can  be  identified  and  measured  using  the  computer’s  unique 
capabilities?  Are  scaling  techniques  available  that  are  appropriate 
for  those  new  measures?  How  does  the  utility  of  new  measurement  methods 
compare  with  that  of  traditional  testing? 


THE  STATE  OF  THE  ART 

The  problems  originally  hindering  the  development  and  implementa¬ 
tion  of  adaptive  testing  were  (a)  psychometric  and  (b)  practical.  The 
psychometric  problems  concerning  adaptive  tests  included  the  inappropri¬ 
ateness  of  classical  test  theory,  the  lack  of  prescriptions  for  their 
design,  the  need  for  methods  of  scoring,  and  the  need  for  assessing  the 
measurement  properties.  The  practical  problems  included  the  need  to 
develop  new  media  for  administering  adaptive  tests  and  the  difficulty 
of  assembling  the  large  pools  of  test  items  demanded.  Each  of  these 
problems  will  be  discussed  below,  followed  by  a  brief  exposition  of 
the  state  of  the  art  relevant  to  solution  of  specific  problems. 
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Psychometric  Theory 


Discussion 


Traditional,  or  classical,  test  theory  is  inadequate  to  deal  with 
some  of  the  psychometric  problems  posed  by  adaptive  tests.  The  problem 
in  classical  test  theory  was  to  order  persons  with  respect  to  an  indi¬ 
vidual  differences  variable  on  the  basis  of  their  number  correct  or 
proportion  correct  on  common  or  equivalent  tests.  The  observed  score 
was  assumed  to  differ  from  the  "true  score"  by  a  random  variable  that 
was  uncorrelated  with  true  score.  In  adaptive  testing,  different  per¬ 
sons  respond  to  sets  of  test  items  that  are  in  no  sense  equivalent 
across  persons.  These  individualized  tests  may  differ  in  difficulty, 
length,  and  the  discriminating  powers  of  their  items.  Obviously,  the 
number  or  proportion  of  correct  scores  is  generally  an  inappropriate 
index  of  individual  differences;  additionally,  measurement  error  cannot 
be  assumed  to  be  independent  of  the  variable  being  measured.  A  test 
theory  was  needed  that  could  accommodate  the  special  requirements  of 
adaptive  tests. 

Several  solutions  to  this  problem  might  be  forthcoming.  A  class 
of  solutions  currently  exists,  in  the  body  of  latent  trait  mental  test 
theories,  or  item  response  theory.  These  "theories"  are  actually  statis¬ 
tical  formulations  that  account  for  test  item  responses  in  terms  of  the 
respondent's  location  on  a  scale  of  the  attribute  being  measured  by  the 
item.  The  best  developed  formulations  to  date  deal  with  dichotomous 
item  responses  as  functions  of  a  unidimensional  attribute  variable. 

In  the  language  of  ability  and  achievement  testing,  latent  trait 
methods  treat  the  probability  of  a  correct  response  to  a  test  item  hs 
a  monotonic  increasing  function  of  the  relevant  underlying  ability.  When 
a  scale  for  the  ability  is  established,  the  latent  trait  methods  provide 
mathematical  models  relating  response  probability  to  scale  position. 

These  models  are  item  trace  lines,  or  item  characteristic  curves  (i.c.c.). 

Once  a  scaling  of  the  attribute  has  been  accomplished  and  all  the 
item  characteristic  functions  are  known,  the  location  of  an  individual 
on  the  attribute  continuum  can  be  estimated  statistically  from  the  di- 
chotomously  scored  responses  to  any  subset  of  the  test  items.  Such  an 
estimate  is  a  kind  of  "test  score";  the  advantage  of  using  latent  trait 
methods  for  scoring  is  that  all  scores  are  expressed  in  the  same  metric, 
regardless  of  the  length  or  item  composition  of  the  test.  Thus,  within 
the  limits  of  the  method,  automatic  equating  of  different  tests  can  be 
effected  merely  by  using  latent  trait  methods  for  scoring  the  tests. 

This  feature  makes  latent  trait  test  theory  an  especially  appropriate 
basis  for  adaptive  testing. 

The  prevailing  trend  in  application  of  latent  trait  methods  has 
been  to  scale  the  measured  attribute  in  such  a  way  that  all  item  char¬ 
acteristic  curves  have  the  same  functional  form,  differing  from  item 
to  item  only  in  the  parameters  of  the  item  characteristic  functions. 


Thus,  once  the  general  functional  form  has  been  established,  each  test 
item  can  be  completely  characterized  and  differentiated  from  other  test 
items  by  the  parameter(s)  of  its  i.c.c.  For  attributes  such  as  ability 
and  achievement  variables,  where  item  trace  lines  should  be  monotonic 
in  torm,  several  similar  response  models  have  been  developed  in  detail. 
These  include  a  one-parameter  logistic  ogive  model  due  to  Rasch  (I960) , 
of  which  Wright  (1968;  Wright  &  Panchapakesan ,  1969)  has  been  a  lead¬ 
ing  proponent  in  this  country;  a  two-parameter  extension  of  the  Rasch 
model  by  Urry  (1970) ;  a  slightly  different  two-parameter  logistic  ogive 
model  developed  by  Birnbaum  (1968) ;  a  similar  model  based  on  the  normal 
ogive,  developed  by  Lord  (1952;  Lord  &  Novick,  1968);  and  a  three- 
parameter  logistic  ogive  model  (Birnbaum,  1968) .  All  of  these  models 
express  the  probability  of  a  correct  (or  keyed)  response  to  a  dichoto- 
mously  scored  test  item  as  an  ogive  function  of  attribute  level.  Syn¬ 
tactically,  this  may  be  expressed 

P  (1/A)  =  F  (a,b,c;A) .  (1) 

The  expression  on  the  left  of  the  equality  is  the  probability  of  the 
keyed  (1)  response  to  item  g,  given  A,  the  attribute  level.  F  (a,b,c;A) 
is  a  general  mathematical  function  in  the  item  parameters  a,  b,  and  c 
and  the  person  parameter,  attribute  level  A.  In  the  ogive  models,  F 
is  an  ogive  function  of  the  distance  (b  -A) ,  a  scale  parameter  a,  and 
an  asymptote  parameter,  c. 

Where  more  than  one  item  is  administered,  the  probability  of  any 
pattern  (V) ,  or  vector,  of  item  scores  may  be  calculated  readily  by 
virtue  of  a  local  independence  assumption.  Thus 

k  u  1-u 

P  (v/A)  =  n  [P  (1/A)]  9  [1-P  (1/A)]  g  .  (2) 

g«l 


Here  P  (v/A)  is  the  probability  of  the  pattern  of  item  scores  (l's  and 
0's),  given  A;  Ug  is  the  dichotomous  score  on  item  £.  From  P  (v/A)  we 
may  derive  expressions  for  the  likelihood  of  any  given  attribute  level, 
given  the  item  response  vector.  This  permits  us  to  apply  statistical 
techniques  to  the  estimation  of  A,  if  the  response  pattern,  v,  and  the 
item  parameters  are  known  (or  estimated)  beforehand.  There  are  also 
simple,  nonstatistical  techniques  for  combining  item  responses  into 
other  indices  of  individual  differences  on  the  attribute.  (See  Lord, 
1974a,  for  pertinent  discussion.) 

Given  that  latent  trait  test  theories  in  principle  can  satisfy 
the  special  requirements  of  adaptive  tests,  it  remains  to  explicate 
such  theories  sufficiently  to  provide  practical  methods  for  estimating 
the  parameters  of  each  test  item's  characteristics  curve  and  for  esti¬ 
mating  examinee  location  on  the  attribute  scale. 
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State  of  the  Art 


Statistical  methods  for  estimating  item  parameters  and  attribute 
levels  have  been  developed  for  all  the  ogive  models  mentioned  above. 
Computer  programs  for  item  parameter  estimation  are  available  (commer¬ 
cially  or  by  private  arrangement)  from  sources  listed  in  Table  1.  Most 
of  these  computer  programs  perform  simultaneous  estimation  of  examinee 
"ability"  and  of  the  item  parameters.  The  statistical  estimation  tech¬ 
niques  used  by  these  programs  range  from  simple  approximations  in  FORTAP 
(Baker  &  Martin,  1969)  to  maximum  likelihood  in  LOGIST  (Wood,  Wingersky, 
&  Lord,  1976),  FORTAP  and  BICAL  (Wright  &  Mead,  1977),  to  Bayesian  model 
estimation  in  OGIVEIA  (Urry,  1976) . 


Table  1 

Existing  Computer  Programs  for  Estimating  Item  Parameters 
of  Latent  Trait  Item  Response  Models 


Response  model 

Program  name 

Available  from 

1 — parameter  logistic 
(Rasch  model) 

BICAL 

B.  Wright,  U.  of  Chicago 

2 — parameter 

logistic 

LOGOG 

R.  D.  Bock,  U.  of  Chicago 

2 — parameter 
ogive 

normal 

FORTAP 

NORMOG 

F.  B.  Baker,  U.  of  Wisconsin 
R.  D.  Bock,  U.  of  Chicago 

3 — parameter 

logistic 

LOGIST 

R.  M.  Lord 

Educational  Testing  Service 

3 — parameter 

logistic 

OGIVEIA 

or 

ANCILLES 

V.  W.  Urry 

Office  of  Personnel 
Management 

Item  parameter  estimation  procedures  generally  entail  simultaneous 
estimation  of  a  person's  ability.  The  task  of  ability  estimation  (or 
test  scoring)  in  the  context  of  adaptive  testing  is  less  demanding.  All 
item  parameters  have  been  estimated  beforehand;  what  remains  is  to  esti¬ 
mate  ability  (or  to  score  the  tests  in  some  other  appropriate  way)  from 
knowledge  of  the  item  responses  and  the  item  parameters.  The  state  of 
the  art  of  scoring  adaptive  tests  is  outlined  below. 

To  summarize,  latent  trait  theories  have  been  shown  to  provide  ap¬ 
propriate  psychometric  bases  for  adaptive  testing  (see  Lord,  1974a; 

Urry,  1977) .  These  theories  have  been  well  explicated  for  application 
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to  tests  of  unidimensional  attributes,  using  dichotomously  scored  items. 
Mathematical  algorithms  have  been  developed  for  scaling  attribute  vari¬ 
ables  and  for  estimating  item  characteristic  curve  parameters  and  examinee 
ability  or  attribute  level.  These  algorithms  have  been  incorporated  into 
computer  programs  that  process  raw  item  responses  and  yield  the  desired 
parameter  estimates.  These  computer  programs  are  available  from  their 
developers . 

Generalizations  of  latent  trait  methods  to  measure  unidimensional 
variables  by  means  of  nondichotomous  test  items  have  also  been  accom¬ 
plished.  Samejima  (1969)  presented  methods  for  extending  the  normal 
ogive  response  model  to  graded  response  items.  She  has  since  extended 
it  to  apply  to  items  having  continuous  responses  (Samejima,  1973). 

Bock  (1972)  developed  equations  for  estimating  item  parameters  and  in¬ 
dividual  ability  from  nominal  category  responses  to  polychotomous  test 
items.  Although  they  have  seen  relatively  few  applications,  Samejima's 
and  Bock's  algorithms  have  been  incorporated  into  available  computer 
programs.  Using  graded,  polychotomous,  or  multinomial-response  test 
items  has  potential  for  appreciable  gains  in  psychometric  information^ 
compared  to  the  information  in  dichotomously  scored  items. 

A  further  advance  in  latent  trait  item  response  models  is  the  ex¬ 
tension  of  these  mouels  to  handle  multidimensional  test  items.  Samejima 
(1973)  has  begun  work  in  this  area,  as  has  Sympson  (1977). 


The  Design  of  Adaptive  Tests 


Discussion 


Choosing  an  Adaptive  Testing  Strategy.  An  adaptive  test  is  one 
that  tailors  the  test  constitution  to  examinee  ability  or  attribute 
level;  given  this  definition,  we  are  confronted  with  the  problem  of 
how  to  accomplish  tailoring.  This  problem  of  individualized  test  de¬ 
sign  can  be  brought  into  conceptual  focus  by  considering  that  ,  given  a 
fixed  large  set  of  test  items  from  which  only  a  relatively  small  subset 
is  to  be  administered  to  an  individual  examinee,  there  exists  a  subset 
that  is  optimal,  in  some  sense,  at  any  specified  test  length.  The 
items  that  constitute  the  optimal  subset  will  vary  .is  a  function  of 
the  individual's  attribute  level.  The  problem  of  adaptive  test  design 
is  that  of  selecting  approximately  optimal  item  subsets  for  each  indi¬ 
vidual  examinee.  Solutions  to  this  problem  are  called  strategies  for 
adaptive  test  design. 

An  adaptive  testing  strategy  consists,  minimally,  of  rules  for 
item  selection  and  for  test  termination;  a  scoring  procedure  may  also 
be  an  integral  part  of  some  strategies.  For  comprehensive  reviews  of 


The  term  "information"  here  refers  to  information  in  the  sense  pre¬ 
sented  by  Birnbaum  (1968)  and  discussed  below. 
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a  variety  of  adaptive  test  strategies,  see  Weiss  (1974)  or  Weiss  and 
Betz  (1973). 

The  essential  rationale  for  adaptive  item  selection  involves  ad¬ 
ministering  more  difficult  items  following  successful  performance  and 
easier  items  following  less  successful  performance.  If  the  test  is 
item-sequential,  this  translates  to  selecting  a  harder  item  after  a 
correct  item  response,  and  an  easier  item  following  an  incorrect  re¬ 
sponse.  Choosing  the  appropriate  difficulty  increment  is  one  aspect 
of  the  design  problem.  Another  central  aspect  is  choosing  the  cri¬ 
terion  to  be  optimized. 

The  purpose  of  mental  testing  usually  is  to  order  examinees  with 
respect  to  their  relative  attribute  status.  To  achieve  this  purpose, 
it  is  necessary  to  be  able  to  discriminate  accurately  between  any  two 
examinees,  no  matter  how  close  they  are  in  terms  of  the  attribute.  The 
required  discriminability  has  implications  for  the  traditional  diffi¬ 
culty  index  of  the  items  to  be  chosen:  Using  dichotomous  items  on 
which  guessing  is  no  factor  to  discriminate  best  about  a  point,  choose 
test  items  for  which  the  probability  correct  is  .50  at  the  point  in 
question.  If  guessing  is  a  factor,  the  optimal  p-value  will  exceed 
.5  by  an  amount  that  is  a  function  of  the  effect  of  guessing.  However, 
if  the  available  test  items  also  differ  with  respect  to  discriminating 
power,  the  latter  also  must  enter  into  the  determination  of  which  item 
discriminates  best  locally.  The  information  function  (Birnbaum,  1968) 
of  a  test  item  provides  a  single  numerical  index  by  which  test  items 
may  be  ordered  with  respect  to  their  usefulness  for  discriminating  at 
a  given  point.  In  terms  of  equation,  the  information  I  in  item  g  at 
attribute  level  A  is  expressed  as 


I  (A) 

g 


3/3A  P  (1/A) 

_2_ 


Z[Pg(l/A) ] [ 1-P  ( 1/A) ] 


(3) 


That  item  is  "best"  for  which  the  local  value  of  Ig(A)  is  highest.  For 
a  k-item  test,  the  best  subset  of  k  items  is  the  subset  for  which  Ig(A) 
is  locally  highest.  The  implication  for  adaptive  test  design  is  to 
choose  items  so  as  to  maximize  Ig(A)  at  all  points  A.  This  maximiza¬ 
tion  is  the  goal  of  adaptive  test  design.  Adaptive  testing  strategies 
may  or  may  not  explicitly  seek  to  achieve  this  goal;  and  the  goal  may 
be  realized  to  a  greater  or  lesser  extent  by  the  different  test 
strategies . ^ 


Analogous  to  the  item  information  function  are  two  others — the  test 
information  function  and  the  test  score  information  function,  both  of 
which  index  measurement  precision  as  a  function  of  attribute  level. 


13 


Adaptive  test  strategies  differ  in  a  number  of  ways.  One  general 
dimension  of  these  differences  is  their  item  selection  mode.  Some 
strategies  arrange  test  items  a  priori  by  difficulty  and  discrimina¬ 
tion  into  a  logical  structure,  such  as  a  one-  or  two-dimensional  matrix 
and  select  items  sequentially  according  to  examinee  performance  by 
branching  to  a  predetermined  location  in  the  structure  and  administer¬ 
ing  the  item(s)  that  reside  in  that  location.  Such  strategies  may  be 
called  "mechanical"  by  virtue  of  their  almost  mechanical  rules  for  item 
selection.  Examples  of  mechanical  strategies  include  the  simple  branch¬ 
ing  strategies;  the  stair-step  or  pyramidal  method  used  by  Bayroff  and 
Seeley  (1967)  and  by  Larkin  and  Weiss  (1974)  and  described  by  Lord 
(1974a) ;  the  flexilevel  tailored  test  devised  by  Lord  (1971a) ;  the 
simple  two-stage  strategy,  investigated  by  Lord  (1971b)  and  by  Betz 
and  Weiss  (1974);  the  stratified  adaptive  (STRADAPTIVE)  procedure  pro¬ 
posed  by  Weiss  (1973) ;  and  even  the  Robbins-Munro  procedures  described 
by  Lord  (1971c;  1974a) . 

Distinguished  from  the  mechanical,  or  branching,  strategies  are 
adaptive  strategies  that  use  mathematical  criteria  for  item  selection. 
Such  strategies  typically  estimate  the  examinee's  latent  attribute 
status  after  each  item  response,  then  choose  the  available  item  from 
which  some  mathematical  function  of  that  estimate  and  of  the  item  param¬ 
eters  is  maximized  or  minimized.  Examples  of  mathematical  strategies 
include  Owen's  (1969,  1975)  Bayesian  sequential  procedure,  in  which  a 
quadratic  loss  function  is  minimized;  and  Lord's  (1977)  maximum  like¬ 
lihood  strategy  in  which  the  available  item  with  the  largest  local  in¬ 
formation  function  is  chosen. 

One  of  the  clearest  distinctions  between  mechanical  and  mathemati¬ 
cal  strategies  is  that  in  the  latter  every  unadministered  test  item  is 
potentially  eligible  for  selection  at  any  stage  in  the  test,  whereas  in 
a  mechanical  strategy  only  a  small  number  of  items — as  few  as  two — are 
eligible  for  selection  at  any  given  stage.  Another  obvious  distinction 
is  that  the  mathematical  strategies  are  appealing  by  virtue  of  their 
elegance,  whereas  the  virtue  of  the  mechanical  strategies  is  their  sim¬ 
plicity.  In  confronting  the  problem  of  choosing  an  adaptive  strategy, 
one  first  must  choose  between  elegance  and  simplicity.  Then,  by  elect¬ 
ing  categorically  either  a  mechanical  or  mathematical  strategy,  one 
is  faced  with  the  further  choice  of  a  specific  adaptive  testing  strate¬ 
gy.  The  number  of  strategies  proposed  for  use  has  proliferated  faster 
than  have  research  results  useful  to  guide  the  choice. 

The  Test  Length  Issue.  Confounded  with  the  problem  of  choosing 
a  testing  strategy  is  the  problem  of  test  length.  Like  conventional 
tests,  adaptive  tests  may  be  short  or  long;  unlike  most  conventional 
tests,  adaptive  tests  may  adapt  test  length,  as  well  as  test  design, 
to  the  individual. 

The  notion  of  variable  length  test  seems  to  make  sense,  since  the 
examiner  can  administer  as  few  or  as  many  items  as  necessary  to  measure 
each  individual  with  a  specified  degree  of  precision.  Furthermore,  it 
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is  apparent  that  if  measurement  precision  is  to  be  held  constant,  achieve- 
ing  that  precision  should  require  relatively  few  items  for  persons  whose 
attribute  level  is  near  the  central  tendency  of  the  population,  and  more 
items  for  persons  located  in  the  upper  and  lower  extremes  of  the  attri¬ 
bute  continuum.  Roughly  speaking,  if  precision  is  to  be  held  constant, 
the  required  adaptive  test  length  should  be  a  U-shaped  function  of  at¬ 
tribute  level. 

Among  the  proponents  of  variable  length  adaptive  tests  are  Samejima 
(1977),  Urry  (1974,  1977a),  and  Weiss  (1973).  Weiss  advocates  the  use  of 
a  simple  stopping  rule  based  on  identifying  a  "ceiling  level"  of  diffi¬ 
culty  for  each  examinee  in  conjunction  with  stratified  adaptive  (STRAD- 
APTIVE)  strategy.  Samejima  (1977)  proposed  that  test  length  be  varied 
such  that  a  constant  level  of  measurement  precision  (indexed  by  the  test 
information  function)  be  achieved  throughout  a  prespecified  range  on  the 
attribute  scale.  Urry  (1974)  espouses  using  variable  test  length  in  con¬ 
junction  with  Owen's  Bayesian  sequential  adaptive  strategy  in  such  a  way 
as  to  yield  a  prespecified  level  of  the  validity^  of  the  test  scores  as 
a  measure  of  the  underlying  attribute;  the  squared  validity  may  be  in¬ 
terpreted  as  a  reliability  coefficient. 

It  should  be  pointed  out  that  some  adaptive  testing  strategies  are 
inherently  fixed-length.  Among  these  are  the  flexilevel,  pyramidal,  and 
two-stage  strategies.  Others,  like  Weiss'  and  Owen's  strategies,  make 
fixed-length  optional.  The  variable-length  test  termination  criteria 
espoused  by  Urry  and  Samejima  can  in  principle  be  used  with  any  adaptive 
strategy— even  the  ones  described  above  as  inherently  fixed-length. 

Weiss'  criterion  for  variable-length  termination  of  the  STRADAPTIVE 
test,  however,  is  somewhat  restricted  in  applicability  because  it  re¬ 
quires  a  certain  structure — stratification  by  dif f iculty--of  the  item 
pool . 


Given  the  intuitive  appeal  of  variable  test  length,  two  problems 
remain.  One  problem  is  to  decide  between  variable  versus  fixed  test 
length  and  which  of  the  available  test  termination  criteria  to  adopt. 
The  other  problem  is  to  verify  that  the  apparent  advantages  of  variable 
test  length  are  realized  in  practice. 


State  of  the  Art 


Choosing  an  Adaptive  Strategy.  One  of  the  first  steps  in  imple¬ 
menting  a  program  of  adaptive  testing  must  be  to  choose  an  adaptive 
testing  strategy  from  among  those  available.  This  choice  should  be 
an  informed  one,  based  on  the  results  of  research  comparing  the  merits 


By  "validity"  is  meant  the  correlation  between  the  test  score  (ability 
estimate)  and  the  underlying  true  ability.  This  correlation  is  esti¬ 
mated  from  the  Bayesian  posterior  variance  under  Owen’s  method  follow¬ 
ing  each  item  response  by  an  examinee. 
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of  available  methods.  Very  little  research  has  been  conducted  along 
these  lines,  however.  Instead,  most  adaptive  testing  research  has  con¬ 
centrated  on  comparing  the  psychometric  properties  of  specific  adaptive 
test  strategies  against  the  properties  of  otherwise  comparable  conven¬ 
tional  test  designs.  Weiss  and  Betz  (1973)  reviewed  the  results  of 
these  comparisons. 

Some  live-testinq  research  comparing  adaptive  strategies  was  re¬ 
ported  by  Larkin  and  Weiss  (1975) .  Only  two  strategies  were  compared, 
however,  and  the  results  were  equivocal.  The  only  other  data  available 
as  a  basis  for  comparing  adaptive  strategies  are  data  resulting  from 
analytic  studies  of  the  properties  of  various  strategies  and  from  model¬ 
sampling  computer  simulation  studies  of  similar  properties.  Lord  (1970; 
1971a,  b,  c)  reported  the  results  of  analytic  studies  of  several  adap¬ 
tive  strategies,  but  made  no  effort  to  compare  them.  The  only  studies 
that  directly  compared  several  strategies  were  the  simulation  studies 
of  Vale  (1975)  and  McBride  (1976b). 

Vale’s  study  compared  five  leading  strategies  in  terms  of  the  level 
and  shape  of  the  resulting  test  information  functions;  in  other  words, 
in  terms  of  relative  measurement  precision  as  a  function  of  attribute 
level.  Vale's  artificial  data  were  based  on  a  response  model  that  did 
not  permit  guessing.  Further,  he  presented  data  only  for  24-item  fixed- 
length  tests.  His  results  indicated  that  under  the  conditions  simulated, 
the  Bayesian  test  strategy  was  superior  in  terms  of  the  level  of  measure¬ 
ment  precision,  whereas  the  stradaptive  strategy  was  superior  in  terms 
of  measuring  with  constant  precision  at  all  levels  of  the  attribute. 

The  other  adaptive  strategies  compared — the  flexilevel,  pyramidal,  and 
two-stage  strategies--all  were  inferior  to  the  first  two  in  some  way. 

Vale's  study  simulated  only  the  no-guessing  situation  and  a  single 
test  length  and  did  not  investigate  mathematical  strategies  other  than 
the  Bayesian  one.  McBride  (1976b)  extended  Vale's  results  in  a  series 
of  simulation  studies  comparing  the  psychometric  properties  of  two 
mathematical  and  two  leading  mechanical  strategies  at  six  different 
test  lengths  and  under  several  realistic  conditions,  including  the 
presence  of  guessing.  His  results  indicated  that  the  two  mathematical 
strategies  were  generally  superior  to  the  mechanical  ones,  especially 
at  short  test  lengths  (5  to  15  items) ,  both  in  terms  of  test  fidelity 
(validity)  and  measurement  precision.  At  moderate  test  lengths  (20  to 
30  items),  the  mathematical  strategies  were  still  superior,  but  their 
advantages  over  the  mechanical  strategies  were  slight. 

The  two  mathematical  strategies  were  Owen's  Bayesian  sequential 
one,  and  a  variant  of  a  maximum  likelihood  strategy  proposed  by  Lord 
(1977).  Differences  in  results  between  the  two  were  slight,  but  the 
maximum  likelihood  strategy  was  judged  superior  in  adaptive  efficiency — 
the  degree  to  which  the  methods  select  the  optimal  subset  of  items  at 
a  given  test  length — and  also  in  several  other  respects. 


16 


McBride  concluded  that  his  data  favored  the  maximum  likelihood 
strategy  overall,  but  that  the  choice  among  the  four  strategies  should 
be  influenced  by  other  considerations.  For  example,  the  Bayesian 
strategy  was  the  best  of  the  four,  in  terms  of  adaptive  efficiency, 
at  very  short  test  length  (5  items)  when  all  examinees  began  the  test 
at  the  same  level  of  difficulty;  at  the  longer  test  lengths  (25  and 
30  items),  all  four  strategies  had  excellent  measurement  properties, 
and  any  one  of  them  could  reasonably  be  chosen. 

It  is  important  to  note  that  McBride's  comparison  studies  were 
carried  out  so  that  the  correct  test  item  parameters  were  known  and 
available  when  simulating  each  adaptive  strategy.  In  live  testing,  of 
course,  only  fallible  estimates  of  the  parameters  of  the  item  charac¬ 
teristic  curves  are  available.  The  use  of  fallible  estimates  should 
introduce  measurement  errors  over  and  above  those  entering  into  McBride's 
data.  It  is  possible  that  the  effects  of  such  errors  could  alter  some 
of  the  conclusions  McBride  reached  concerning  the  order  of  merit  of  the 
four  strategies  he  evaluated.  Research  is  needed  extending  his  findings 
to  the  case  of  fallibly  estimated  item  parameters. 

Vale's  (1975)  and  McBride's  (1976b)  simulation  studies  are  the 
only  ones  available  for  comparing  strategies.  There  is,  however,  a 
sizable  body  of  research  results  available  for  evaluating  several  in¬ 
dividual  adaptive  strategies  against  conventional  tests.  Urry  and  his 
associates  (Urry,  1971,  1974,  1977b;  Jensema,  1972,  1974,  1977;  Schmidt 
&  Gugel,  1975)  have  reported  results  of  a  comprehensive  program  of  com¬ 
puter  simulation  investigations  of  some  psychometric  properties  of 
Owen's  Bayesian  sequential  adaptive  test.  Vale  and  Weiss  (1975)  re¬ 
port  in  considerable  detail  the  measurement  properties  of  the  stradap- 
tive  strategy.  Lord  (1977)  recently  proposed  the  broad-range  tailored 
test  (a  maximum  likelihood  strategy)  and  reported  some  data  relevant  to 
its  psychometric  properties.  All  of  these  investigations  have  utilized 
model- samp ling  computer  simulation  methods  to  explore  the  behavior  of 
the  various  test  strategies.  All  have  also  taken  different  lines  of 
approach  and  concentrated  on  different  aspects  of  each  strategy's  psy¬ 
chometric  behavior,  so  that  it  is  not  possible  to  compare  the  strate¬ 
gies  on  the  basis  of  the  available  reported  data. 

Fixed-Length  Versus  Variable-Length  Adaptive  Tests.  There  has 
been  no  systematic  study  of  the  relative  merits  of  variable-length 
versus  fixed-length  adaptive  tests.  Rather,  researchers  in  this  area 
have  tended  to  make  an  a  priori  choice  between  the  two  options  and 
leave  the  choice  unquestioned.  Working  independently  and  motivated 
by  different  considerations,  Samejima  (1976),  Urry  (1974),  and  Weiss 
(1973)  all  chose  in  favor  of  variable  length.  Lord  (1977),  however, 
opted  for  fixed  length  in  proposing  his  broad-range  tailored  test. 

Samejima  (1977),  working  in  the  framework  of  a  maximum  likelihood 
strategy,  suggested  that  the  test  information  function  be  estimated 
for  each  individual  after  each  item  response.  The  test  may  be  termi¬ 
nated  when  the  estimated  value  of  the  information  function  reaches 
a  prespecified  level.  The  effect  of  using  the  test  termination  rule 
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wouLd  be  to  achieve  a  virtually  horizontal  test  information  function 
throughout  a  wide  interval  of  the  attribute  continuum.  This  is  tanta¬ 
mount  to  using  the  test  termination  rule  to  guarantee  equiprecision  of 
measurement  over  a  specified  range,  which  is  one  of  the  principal  moti¬ 
vations  behind  adaptive  testing.  No  data  are  available  to  indicate 
whether  Samejima's  test  termination  criterion  would  actually  achieve 
its  purpose. 

Urry  and  others  (Urry,  1974;  Jensema,  1977;  Schmidt  &<  Gugel,  1975) 
favor  variable  test  length  for  use  with  Owen's  Bayesian  sequential  adap¬ 
tive  test  strategy.  Under  Owen's  procedure,  the  posterior  variance  of 
the  distribution  of  the  Bayes  estimator  is  calculated  following  each 
test  item  response;  that  variance,  which  usually  diminishes  after  each 
item,  is  interpreted  by  Urry  as  the  square  of  the  standard  error  of 
estimate  (s.e.m.)  of  the  examinee's  attribute  level.  Thus,  by  termi¬ 
nating  each  test  when  the  calculated  standard  error  reaches  a  prespeci¬ 
fied  small  value,  the  standard  error  of  estimation  in  the  examinee 
group  can  be  controlled  and  consequently  so  can  an  index  of  reliability 
of  the  ability  estimates.  Thus,  Urry  advocates  a  variable  length  test 
termination  rule  to  ensure  (approximately)  that  the  adaptive  test  scores 
have  a  prespecified  level  of  correlation  with  the  latent  attribute  being 
measured . 

Urry  (1971,  1974)  and  Jensema  (1977)  have  presented  the  results 
of  numerous  simulation  studies  of  Owen's  procedure  to  show  that  the 
fidelity  coefficient  of  the  test  scores  can  be  controlled  by  using  the 
posterior  variance  as  a  test  termination  rule.  These  studies  all  used 
the  true  values  of  the  simulated  test  items'  parameters  for  item  selec¬ 
tion  and  scoring.  Schmidt  and  Gugel  (1975)  presented  simulation  study 
data  for  the  more  veridical  case  in  which  fallible  item  parameter  esti¬ 
mates  are  used.  The  effect  of  using  fallible  item  parameters  with 
Owen's  procedure  was  a  tendency  for  the  tests  to  terminate  prematurely, 
with  the  result  that  the  obtained  fidelity  coefficients  fell  slightly 
short  of  the  targeted  values. 

Subsequent  to  his  computer  simulation  studies,  Urry  (1977b)  ad¬ 
ministered  Bayesian  adaptive  tests  of  verbal  ability  to  live  examinees. 
His  analysis  of  the  adaptive  test  data  evaluated  the  usefulness  of  the 
s.e.m.  test  termination  criterion  for  controlling  the  level  of  "con¬ 
struct  validity" — correlation  of  the  resulting  test  scores  with  an 
independent  measure  of  the  same  ability.  Urry  found  that  for  all  the 
evaluated  levels  of  the  s.e.m.  criterion,  the  obtained  validity  coef¬ 
ficient  was  equal  to  or  slightly  greater  than  the  forecast  validity 
associated  with  each  test  termination  criterion.  He  concluded  that 
the  theory  was  supported;  that  it  was  possible  to  control  the  relia¬ 
bility  and  validity  of  a  test  by  using  the  Bayesian  procedure  and 
manipulating  the  posterior  variance  termination  criterion. 

Urry  and  others  have  been  successful  in  controlling  adaptive  test 
validity/fidelity/reliability  by  manipulating  test  termination  criteria. 
Tiiat  success  notwithstanding,  they  have  not  demonstrated  that  equipre¬ 
cision  of  measurement  (a  flat  information  function)  could  be  achieved 
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using  their  proposed  variable  test  length  procedure,  nor  have  they  at¬ 
tempted  to  do  so.  McBride  (1977),  in  simulation  studies  of  the  same 
Bayesian  procedure,  found  a  strong  positive  correlation  (.8)  between 
test  length  and  ability  when  variable  test-length  was  used;  i.e.,  the 
termination  criterion  was  satisfied  in  fewer  items  for  lower  ability 
examinees.  His  data  indicated  that  the  relationship  between  attribute 
level  and  test  length  is  not  U-shaped,  as  it  should  be  to  approximate 
a  horizontal  information  function.  As  a  result,  the  information  func¬ 
tions  of  tiie  simulated  Bayesian  variable-length  tests  tended  to  be 
convex  in  shape,  with  markedly  low  values  in  the  low  end  of  the  attri¬ 
bute  range.  McBride  concluded  that  there  may  be  greater  virtue  in 
fixed-length  Bayesian  adaptive  tests. 

It  should  be  clear  by  now  that  some  issues  involved  in  choosing 
an  adaptive  testing  strategy  and  in  deciding  between  fixed  and  variable 
test  length  remain  unresolved.  Additional  research  in  both  areas  is 

needed . 


These  unresolved  issues  need  not  impede  progress  in  the  experi¬ 
mental  implementation  of  systems  for  adaptive  testing,  because  the  un¬ 
known  differences  among  the  leading  adaptive  testing  strategies  are 
undoubtedly  of  lesser  magnitude  than  the  difference  between  any  such 
strategy  and  a  conventional  test  design.  Perhaps  recognizing  this, 
Urry  (1977a)  cautions  sternly  against  procrastination  in  implementing 
adaptive  testing.  It  may  be  wiser  to  proceed  by  making  a  tentative 
choice  among  the  strategies  and  an  arbitrary  decision  on  the  test 
length  issue,  letting  the  academic  world  settle  the  remaining  basic 
research  issues  in  due  course. 


Scoring  Adaptive  Tests 


Discussion 


For  most  adaptive  test  strategies,  the  traditional  number  correct 
or  proportion  correct  score  will  not  suffice  to  index  individual  dif¬ 
ferences  on  the  attribute  being  measured.  To  understand  this,  consider 
the  goal  of  adaptive  testing:  to  achieve  equiprecision  of  measurement 
across  a  wide  range.  The  goal  is  achieved  by  fitting  the  test  to  the 
examinee.  Other  things  being  equal,  accomplishing  that  fit  will  result 
in  a  flat  regression  of  the  proportion  correct  score  on  the  attribute 
scale.  That  is,  test  difficulty  (as  indexed  by  mean  proportion  correct) 
will  be  approximately  equal  across  a  wide  range  of  the  attribute.  As 
a  result,  the  proportion  correct  scores  will  have  an  information  func¬ 
tion  whose  value  is  near  zero  throughout  that  wide  range  (e.g.,  McBride, 
1975)  . 


In  practice,  adaptive  tests  can  be  expected  to  fall  somewhat  short 
of  the  goal  of  equiprecision,  so  that  there  may  be  some  information  in 
traditional  scoring  methods.  Nonetheless,  for  the  most  part  the  propor¬ 
tion  correct  and  similar  indices  are  not  adequate  as  general  scoring 
procedures  for  adaptive  tests. 
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An  immediate  exception  is  Lord's  (1971a)  flexilevel  test  strategy, 
which  was  specifically  designed  so  that  the  number  correct  score  would 
be  a  meaningful  index.  The  flexilevel  strategy  aside,  let  us  consider 
the  requirements  and  desiderata  of  an  adaptive  test  scoring  procedure. 
In  an  adaptive  test,  different  persons  take  different  sets  of  test 
items.  These  items  vary  in  difficulty  and  may  also  vary  in  their  dis¬ 
criminating  powers  and  susceptibility  to  guessing.  Further,  under 
some  adaptive  strategies,  test  length  may  vary  from  one  person  to  an¬ 
other,  as  may  the  difficulty  level  at  which  the  test  was  begun.  There 
is  useful  information  in  all  the  parameters  just  mentioned,  so  that  a 
scoring  method  needs  to  account  not  only  for  how  many  items  a  person 
answers  correctly,  but  also  which  items  were  answered,  and  in  some 
cases  which  ones  were  answered  correctly  or  incorrectly.  It  is  de¬ 
sirable  for  the  scoring  procedure  to  make  use  of  all  the  information 
contained  in  the  examinee's  answers,  as  well  as  in  the  identity  of  the 
items  constituting  the  test. 

Scoring  methods  based  on  latent  trait  theory  are  especially  useful 
and  appropriate  for  scoring  adaptive  tests.  This  is  because  such  meth¬ 
ods  can  take  into  account  all  relevant  data  in  the  constitution  of  an 
individual  test — such  as  test  length  and  item  characteristic  curve 
parameters--as  well  as  the  item-by-item  performance  of  the  examinee. 
Some  fairly  simple  methods  are  available,  along  with  others  so  complex 
that  they  require  a  computer  to  perform  needed  calculations.  The  prob¬ 
lem  of  scoring  adaptive  tests  is  the  problem  of  choosing  (or  devising) 
an  appropriate  scoring  method.  Some  of  the  available  methods  are  dis¬ 
cussed  below. 


State  of  the  Art 

The  number  of  scoring  methods  available  for  adaptive  tests  is  siz¬ 
able.  Some  methods  are  general  and  are  applicable  under  a  variety  of 
testing  strategies,  while  others  were  devisee,  ad  hoc  and  are  specific 
to  one  or  a  few  strategies.  Among  the  general  methods  we  can  distin¬ 
guish  statistical  procedures  from  nonstatistical  ones. 

Statistical  Scoring  Procedures.  These  procedures  are  based  on 
techniques  of  combining  known  psychometric  information  about  the  test 
items  with  the  observed  item  response  performance  of  the  examinee  in 
such  a  way  as  to  yield  a  statistical  estimate  of  the  examinee's  loca¬ 
tion  on  the  attribute  scale.  Although  there  are  a  host  of  such  esti¬ 
mation  methods  available,  the  ones  most  prominent  in  the  literature 
have  been  estimators  based  on  the  Rasch  one-parameter  logistic  ogive 
item  response  model,  on  the  Birnbaum  three-parameter  logistic  ogive 
model,  and  on  the  three-parameter  normal  ogive  model. 


Under  tne  Rasch  model,  the  number  correct  score  is  a  sufficient 
statistic  for  the  estimation  procedure, 1  provided  that  the  Rasch  diffi¬ 
culty  parameters  of  the  items  constituting  an  individual  test  are  known, 
there  is  no  guessing,  and  all  items  are  equidiscriminating .  Least 
squares  estimators  and  maximu...  likelihood  estimators  of  attribute  level 
have  been  derived  and  published  (e.g.,  Wright  &  Panchapakesan ,  1969). 

The  maximum  likelihood  estimator  is  somewhat  more  elegant  and  more  ac¬ 
curate.  Estimators  based  on  the  Rasch  model  are  not  strictly  appropri¬ 
ate  for  scoring  tests  having  known  differences  in  item  discrimination 
parameters,  or  on  which  there  is  a  substantial  chance  of  answering 
questions  correctly  by  guessing.  Urry  (1970)  has  evaluated  the  effects 
of  ignoring  guessing  and  item  discriminating  powers  in  scoring  adaptive 
tests;  the  result  is  some  loss  of  accuracy  in  ordering  individual  dif¬ 
ferences.  That  loss  is  reflected  in  the  validity  of  the  adaptive  test 
scores  for  measuring  the  relevant  attribute.  In  sum,  where  the  Rasch 
model  is  appropriate,  its  use  for  scoring  adaptive  tests  is  not  ques¬ 
tioned.  Where  it  is  inappropriate,  a  scoring  procedure  based  on  a  more 
general  response  model  will  extract  more  useful  information  from  adap¬ 
tive  test  response  protocols. 

The  more  general  item  response  models  include  two-  and  three- 
parameter  normal  and  logistic  ogive  models.  The  logistic  models  can 
readily  be  made  to  approximate  closely  the  normal  models.  Because  of 
their  mathematical  tractability ,  the  logistic  ogive  models  have  largely 
supplanted  the  normal  ogive  models  in  use.  Further,  the  three-parameter 
models  are  more  general,  of  which  the  two-parameter  ones  are  special 
cases;  similarly,  the  Rasch  model  is  a  special  case  of  three-parameter 
logistic  model.  Thus,  the  three-parameter  logistic  model  is  the  model 
predominantly  used  in  current  practice. 

Test  scoring  (attribute  estimation)  under  the  three-parameter 
logistic  model  usually  has  been  accomplished  using  iterative  maximum 
likelihood  estimation  procedures.  Such  procedures  use  all  the  infor¬ 
mation  available  in  an  examinee's  dichotomous  item  scores  on  an  adap¬ 
tive  (or  conventional)  test:  item  difficulty,  discrimination,  and 
guessing  parameters;  and  the  pattern  of  the  examinee's  right  and  wrong 
answers.  The  likelihood  equations  used  for  this  scoring  method  have 
been  derived  and  published  (e.g.,  Jensema,  1972).  Algorithms  for  per¬ 
forming  the  estimation  procedure  have  been  incorporated  in  several 
computer  programs  (e.g.,  see  Urry,  1970;  McBride,  1976b;  Wood,  Winger- 
sky,  &  Lord,  1976;  Bejar  &  Weiss,  1979). 

Methods  other  than  maximum  likelihood  may  also  be  used  for  the 
statistical  estimation  of  attribute  scale  location.  Sympson  (1976), 
for  example,  recently  described  two  alternative  methods,  including  a 


For  scoring  an  adaptive  test  using  the  Rasch  model,  the  number  correct 
is  not  admissible  as  a  test  score,  but  rather  as  a  sufficient  statistic 
for  estimating  ability;  the  resulting  estimate  is  the  test  score. 
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(  generalized  Bayesian  one,  for  estimation  under  the  three-parameter 

1  logistic  model. 

There  is  one  prominent  application  of  the  three-parameter  normal 
ogive  response  model  to  estimating  examinee  location  on  the  attribute 
scale,  a  Bayesian  sequential  procedure  given  by  Owen  (1969,  1975). 

Owen's  estimation  technique  was  presented  as  an  integral  part  of  his 
sequential  adaptive  testing  strategy.  It  is  just  as  appropriate  for 
use  as  a  scoring  procedure  for  any  test  where  item  parameters  and  di¬ 
chotomous  item  scores  are  available. 

Both  the  maximum  likelihood  procedure  and  Owen's  Bayesian  sequen¬ 
tial  procedure  are  methods  of  estimating  an  examinee’s  location  on  a 
continuum.  There  are  substantial  differences  in  approach  between  the 
two,  however.  The  maximum  likelihood  procedure  estimates  the  examinee 
location  parameter  from  the  pattern  of  an  examinee's  right  and  wrong 
answers  to  his  or  her  test  questions,  by  solving  a  likelihood  equation. 
No  prior  assumptions  are  involved  regarding  the  examinee's  location 
or  the  distribution  of  the  attribute. 

Owen's  Bayesian  procedure  estimates  examinee  location  sequentially. 
It  begins  with  an  initial  estimate  of  the  location  parameter  and  up¬ 
dates  that  estimate,  one  item  at  a  time,  by  solving  equations  that  con¬ 
sider  both  the  likelihood  function  of  the  single  item  score  and  the 
I'  density  function  of  an  assumed  normal  distribution.  The  ability  esti- 

s  mate  is  the  final  updated  value  after  the  last  item  score  is  considered. 

r 

f 

t  Because  it  is  a  sequential  procedure,  Owen's  scoring  method  is 

|J  order-dependent.  Analyzing  the  same  item  responses  in  different  orders 

can  result  in  slightly  different  numerical  values  of  the  final  esti¬ 
mates.  The  maximum  likelihood  scoring  procedure  is  not  dependent  on 
the  order  in  which  items  are  administered  (or  item  responses  analyzed) . 

Another  noteworthy  difference  between  these  two  methods  concerns 
their  statistical  properties.  Owen's  Bayesian  estimator  behaves  like 
a  regression  estimate:  Extreme  values  are  biased  toward  the  initial 
(prior)  estimate,  which  is  the  mean  of  the  normal  Bayesian  prior  dis¬ 
tribution  assumed  for  the  location  parameter.  This  bias  may  not  be 
linear,  as  McBride  (1975)  demonstrated,  and  may  be  undesirable  for  ap¬ 
plications  (such  as  criterion-referenced  testing)  in  which  the  numeri¬ 
cal  accuracy  of  the  estimator  is  of  some  consequence.  Urry  (1977a) 
pointed  out  that  the  bias  in  the  Bayesian  estimates  is  readily  cor¬ 
rectable  using  an  ancillary  method,  but  no  data  are  available  concern¬ 
ing  the  efficacy  of  Urry's  proposed  correction.  The  maximum  likelihood 
estimator  does  not  seem  to  be  subject  to  the  systematic  bias  of  Owen's 
Bayesian  scoring  method,  but  requires  appreciably  more  computer  process¬ 
ing  time  and  sometimes  fails  to  converqe  to  a  satisfactory  estimate 
(McBride ,  1975)  . 

Sympson  (1976)  reported  developing  two  alternative  methods  for 
the  examinee  parameter  estimation  problem.  One  method  is  a  Bayesian 
method  that  considers  the  examinee's  entire  vector  of  item  scores  at 
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once  and  thus  avoids  the  order-dependence  of  Owen's  sequential  scoring 
method.  It  is  also  more  general  than  Owen's  method  in  that  it  is  not 
restricted  to  assuming  a  normal  prior  distribution  on  the  latent  attri¬ 
bute.  Instead,  the  user  is  free  to  specify  any  form  for  the  Bayesian 
prior  distribution. 

Nonstatistical  Scoring  Procedures.  The  scoring  methods  discussed 
yield  statistical  estimates  of  an  examinee's  location  on  a  scale.  Sev¬ 
eral  less-sophisticated  scoring  methods  are  available  that  yield  numeri¬ 
cal  indices  useful  for  ordering  examinees.  Such  methods  have  the  advan¬ 
tage  of  computational  simplicity,  but  lac-  the  properties  of  statistical 
estimators.  Indices  have  been  proposed  for  several  different  adaptive 
testing  strategies.  Some  of  these  indices  are  specific  to  the  strate¬ 
gies  that  gave  rise  to  them,  while  others  are  generalizable  to  two  or 
more  adaptive  strategies.  Weiss  and  Betz  (1973)  and  Weiss  (1974)  have 
discussed  nonstatistical  scoring  methods  in  detail.  Vale  and  Weiss 
(1975)  evaluated  alternative  methods  against  one  another  and  found  one 
originally  proposed  by  Lord  to  be  generally  superior  to  the  others. 

This  index,  called  the  "average  difficulty  score,”  is  computed  by  sum¬ 
ming  the  item  difficulty  values  of  all  test  items  answered  by  an  examinee 
and  computing  the  average.  The  item  difficulty  values  involved  are  the 
difficulty  parameters  of  the  item  characteristic  curves,  not  the  tradi¬ 
tional  p-value  difficulty  indices. 

The  average  difficulty  score  is  appropriate  for  adaptive  tests  in 
which  all  examinees  begin  testing  at  the  same  difficulty  level.  Although 
it  may  be  used  in  conjunction  with  tests  having  variable  entry  levels, 
its  properties  have  not  been  systematically  investigated  in  such  a  con¬ 
text.  The  weight  given  to  the  difficulty  of  the  first  item  in  a  vari¬ 
able  entry  level  test  may  have  the  effect  of  biasing  test  scores  in  the 
direction  of  the  pretest  estimate  of  the  examinee’s  ability. 

An  alternative  to  the  average  difficulty  score  is  to  calculate 
only  the  average  difficulty  of  the  items  answered  correctly;  however, 
test  scores  calculated  in  this  fashion  correlate  almost  perfectly  with 
the  average  difficulty  of  the  items  administered  (Vale  &  Weiss,  1975). 
Other  nonstatistical  scoring  procedures  evaluated  to  date  have  been 
generally  inferior  to  these  two  methods,  even  for  scoring  appropriate 
types  of  adaptive  tests;  therefore,  they  will  not  be  discussed  here. 

The  Testing  Medium 

The  adaptive  test  merits  consideration  as  a  possible  replacement 
for  conventional  standardized  group  tests.  Therefore,  the  test  admin¬ 
istration  medium  must  be  amenable  to  testing  relatively  large  numbers 
of  examinees.  There  is  a  need  to  identify  media  that  can  meet  this 
requirement  and  to  evaluate  such  media  both  absolutely  and  in  a  com¬ 
parative  sense. 


The  media  available  for  administering  adaptive  tests  fall  into  two 
categories:  specially  designed  paper- and-pencil  tests  and  automated 

testing  terminals.  A  paper-and-pencil  adaptive  test  superficially  re¬ 
sembles  a  conventional  test,  but  requires  the  examinee  to  comprehend 
and  follow  relatively  complex  instructions  for  the  sequential  choice 
of  test  items  and  for  marking  item  responses.  The  added  complexity  of 
the  examinee's  task  in  taking  a  paper-and-pencil  adaptive  test  may  be 
excessive,  particularly  for  lower  ability  persons,  with  the  result 
that  the  dimension  to  be  measured  is  confounded  with  the  examinee's 
ability  to  follow  the  instructions.  If  such  a  confounding  occurs  to 
any  substantial  degree,  the  test  may  be  an  invalid  measure  of  the  in¬ 
tended  trait  dimension.  An  obvious  research  issue  is  to  inventory  the 
available  methods  for  administering  adaptive  tests  in  the  paper-and- 
pencil  medium  and  to  evaluate  the  extent  to  which  examinee  task  com¬ 
plexity  is  excessive. 

Automated  administration  of  an  adaptive  test  relieves  the  examinee 
of  the  burden  of  complying  with  the  complex  instructions;  instead,  the 
testinq  device  assumes  this  burden.  This  benefit  is  not  achieved  with¬ 
out  cost,  however.  Typically,  automated  tests  have  been  administered 
at  interactive  computer  terminals,  a  medium  currently  more  expensive 
than  paper-and-pencil  administration.  For  adaptive  administration  of 
tests  composed  of  items  like  those  in  paper-and-pencil  group  tests — 
typically,  multiple-choice  items--in  principle,  a  device  much  less 
sophisticated  than  a  CRT  computer  terminal  will  suffice.  Test  adminis¬ 
tration  using  such  a  device  should  be  considerably  less  expensive  than 
the  use  of  a  computer.  Clearly,  the  identification  and  design  of  al¬ 
ternative  devices  for  automated  testing  is  an  important  issue  for  re¬ 
search  and  development. 


State  of  the  Art 

Paper-and-Pencil  Adaptive  Tests.  Bayroff,  Thomas,  and  Anderson 
(1960)  designed  experimental  paper-and-pencil  branching  tests  based  on 
Krathwohl  and  Huyser's  (1956)  scheme  for  a  "sequential  item  test,"  a 
pyramidal  adaptive  strategy  (Weiss,  1974) .  On  subsequent  administra¬ 
tion  of  branching  tests  of  word  knowledge  and  arithmetic  reasoning, 
respectively,  Seeley,  Morton,  and  Anderson  (1962)  found  that  5%  and  22% 
of  the  examinees  made  critical  errors  in  following  the  item  branching 
instructions.  Such  errors  made  those  examinees'  answer  sheets  unscor- 
able  under  the  scoring  method  used;  the  tendency  to  such  errors  was 
related  to  general  ability. 

Lord  (1971a)  devised  the  flexi level  testing  method,  an  adaptive 
strategy  specifically  intended  for  paper-and-pencil  testing.  Olivier 
(1974)  administered  flexilevel  tests  of  word  knowledge  to  635  high 
school  students  and  found  that  17%  of  his  examinees’  tests  were  unscor- 
able  because  they  had  made  critical  errors  in  branching. 
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The  Seeley  et  al.  (1962)  and  Olivier  (1974)  experiences  have  cre¬ 
ated  an  air  of  pessimism  about  the  feasibility  of  using  the  paper-and- 
pencil  medium  for  adaptive  testing.  This  pessimism  is  based  on  two 
facts:  (a)  A  substantial  proportion  of  examinees  tested  has  been  un¬ 

able  to  follow  the  item-to-item  brandling  instructions;  (b)  under  the 
scoring  methods  used,  certain  branching  errors  made  the  tests  unscor- 
able.  If  the  concept  of  paper-and-penci 1  adaptive  tests  is  to  be  sal¬ 
vaged,  both  problems  must  be  solved.  That  is,  the  complexity  of  the 
examinee's  task  must  be  reduced,  and  scoring  procedures  must  be  devised 
that  can  accommodate  item  branching  errors. 

The  statistical  scoring  methods  based  on  item  characteristic  curve 
theory,  discussed  in  another  section,  satisfy  the  latter  requirement. 
They  provide  a  means  of  calculating  a  score,  using  a  common  metric,  for 
examinees  who  answered  different  sets  of  test  items.  These  scoring 
methods  are  applicable  even  to  examinees  who  erred  in  item  branching, 
provided  that  it  is  known  which  items  were  answered  and  whether  the 
answers  were  right  or  wrong. 


Since  the  use  of  item  characteristic  curve  theory  in  effect  solves 
the  scorability  problem,  all  that  remains  to  make  paper-and-pencil  adap¬ 
tive  testing  feasible  is  to  minimize  the  problem  of  the  complexity  of 
the  branching  task.  This  problem  has  not  been  solved  to  date,  although 
tentative  approaches  to  its  solution  have  been  taken  (e.g.,  McBride, 

1973)  . 

Perhaps  the  simplest  solution  proposed  is  the  "self-tailored  test" 
suggested  by  Wright  and  Douglas  (1975)  for  use  with  test  items  that 
satisfy  the  Rasch  simple  logistic  response  model.  Test  items  are  printed 
in  the  booklet  in  ascending  order  of  difficulty.  The  examinee  is  in¬ 
structed  to  start  answering  test  items  at  whatever  difficulty  level  he 
or  she  chooses  and  to  stop  where  he  or  she  chooses  (or  perhaps  to  answer 
a  fixed  number  of  items).  The  test  score  (a  Rasch  ability  estimate, 
which  can  be  determined  by  referring  to  a  preprinted  table)  would  be  a 
function  of  the  difficulty  levels  of  the  easiest  item  answered  and  the 
most  difficult  item  answered,  and  the  number  of  items  answered  correctly 
in  between. 

The  Wright  and  Douglas  notion  is  appealing  in  its  simplicity,  but 
it  has  drawbacks.  First,  its  psychometric  merits  depend  heavily  on  the 
ability  and  willingness  of  the  examinee  to  choose  test  items  that  are 
most  informative  for  ability  level — neither  too  difficult  nor  too  easy. 
Second,  its  linear  branching  rules  and  ability-estimation  procedures 
are  not  strictly  appropriate  where  guessing  is  a  factor  and  where  there 
is  appreciable  variability  in  the  discriminating  powers  of  the  test 
items.  Nonetheless,  this  "self- tailored"  testing  scheme  is  worthy  of 
some  exploratory  research  in  settings  where  it  is  desirable  to  reduce 
substantially  the  number  of  items  each  examinee  must  respond  to. 

Where  guessing  is^  a  factor  and  items  vary  appreciably  in  discrimi¬ 
nating  power,  the  optimal  choice  of  items  in  an  adaptive  test  is  a 
function  of  those  variables  as  well  as  of  item  difficulty.  This  suggests 
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that  a  somewhat  more  sophisticated  rationale  is  required  for  adaptive 
item  branching  than  the  simple  linear  progression  implicit  in  the 
Wright  and  Douglas  proposal.  Implementing  a  true  item  branching  pro¬ 
cedure  in  a  feasible  paper-and-pencil  version,  without  overbearing 
complexity,  may  call  for  new  approaches.  The  necessary  approach  is 
to  minimize  the  opportunity  for  error  by  making  the  branching  instruc¬ 
tions  as  simple  as  possible  and  as  few  as  possible. 

Simplicity  may  be  achieved  by  using  latent  ink  technology  in  de¬ 
signing  and  printing  answer  sheets,  thereby  making  the  branching  in¬ 
struction  unambiguous  and  contingent  only  on  what  answer  the  examinee 
gives  to  the  item  he  or  she  is  currently  working  on.  The  frequency  of 
item  branching  can  be  reduced  by  using  a  modified  two-stage  adaptive 
strategy;  the  first  stage  might  be  a  short  branching  test  of  several 
items,  while  the  second  stage  might  be  a  multilevel  test.  The  func¬ 
tion  of  the  first  stage  test  would  be  to  route  the  examinee  to  an  ap¬ 
propriate  level  in  the  second  stage.  Each  level  would  have  the  format 
of  a  short  conventional  test;  thus,  no  branching  instructions  need  be 
followed  during  the  second  stage.  This  notion  was  developed  further 
in  a  separate  paper  (McBride,  1978). 

Automated  Adaptive  Testing.  Most  research  on  adaptive  testing 
has  focused  on  computers  as  control  devices  and  on  computer  terminals 
as  the  medium  for  test  administration.  Although  the  computer  is  a  con¬ 
venient  and  apt  tool  for  automating  testing,  the  relationship  of  com¬ 
puters  to  adaptive  tests  is  sufficient  but  not  necessary.  Any  device 
capable  of  storing  and  displaying  test  items,  recording  and  scoring 
responses,  and  branching  sequentially  from  item  to  item  can  in  princi¬ 
ple  suffice  as  the  testing  medium.  The  computational  power  of  a  com¬ 
puter  may  be  highly  desirable  for  implementing  some  adaptive  testing 
strategies,  but  it  is  far  from  necessary  for  all.  Further,  tests  based 
on  dichotomously  scored  multiple-choice  test  items  make  such  minuscule 
demands  on  the  capability  of  a  modern  computer  that  use  of  a  computer 
solely  for  administration  of  such  tests  seems  wasteful.  Simpler  and 
less  costly  devices  can  do  the  job,  and  such  devices  should  be  developed 

The  first  concrete  effort  to  develop  a  simple  device  for  automated 
adaptive  testing  seems  to  have  been  one  made  at  the  Air  Force  Human 
Resources  Laboratory,  Technical  Training  Division  (AFHRC/TT) .  Person¬ 
nel  there  have  developed  a  prototype  programmable  microprocessor  termi¬ 
nal  for  administering  an  adaptive  test  (Waters,  personal  communication). 
The  terminal  itself  resembles  a  hand-held  desk  calculator,  with  an  array 
of  numbered  keys  used  to  respond  to  test  items.  Its  display  device  is 
a  small  array  of  several  light-emitting  diodes  (LEDs) .  The  unit  is 
preprogrammed  to  direct  an  examinee  to  answer  a  response-contingent 
sequence  of  test  questions  that  are  printed  in  a  separate  test  booklet. 
After  recording  and  scoring  the  examinee’s  response  to  the  current 
test  item,  the  microprocessor  unit  computes  the  location  of  the  next 
item;  the  LED  displays  that  location  as  an  item  number;  the  examinee 
then  turns  to  that  item  in  the  test  booklet  and  responds  by  keying  in 
an  answer  on  the  keyboard.  At  test  termination,  the  examinee's  proto¬ 
col  of  identification  data,  item  responses,  and  test  score  can  be 
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"dumped”  to  a  special-purpose  computer  before  the  next  examinee  is 
tested.  Development  of  the  AFHRL  prototype  is  being  undertaken  by  an 
independent  contractor. 

A  direct  extension  of  the  microprocessor  concept  is  contemplated 
by  AFHRL/TT.  This  wou)  i  involve  using  the  programmable  microprocessor 
both  for  item  selection  and  for  controlling  the  display  on  a  peripheral 
device  of  test  items  stored  in  microform:  film  slides,  microfilm,  or 
microfiche.  The  contemplated  device  would  emulate  the  function  of  a 
full-scale  computer  terminal,  but  with  limited  interactive  capability. 

The  significance  of  this  step  is  that  the  examinee's  role  would  be 
limited  to  answering  the  sequence  of  displayed  test  items;  the  examinee 
would  not  have  to  participate  in  item  selection  or  in  locating  selected 
items . 

In  considering  the  state  of  the  art  with  respect  to  automated 
testing  terminals,  it  is  useful  conceptually  to  consider  the  separate 
components  required  of  a  test  delivery  device.  These  include  the 
following : 

•  Stimulus/display  device 

•  Response  device 

•  Item  storage  medium 

•  Internal  processing 

•  Response  processing  capability 

•  Item  selection  capability 

•  Test  scoring  capability 

•  Data  recording  capability. 

Display  devices  proposed  or  in  use  range  in  complexity  from  simple 
printed  matter,  to  microform  readers,  to  computer  graphics  terminals. 
Microform  readers  include  microfilm  reel  readers,  manual  microfiche 
readers,  and  automated  magazine  microfiche  and  ultrafiche  readers.  These 
microform  devices  are  capable  of  storing  and  displaying  any  test  material 
that  can  be  printed  and  photographed,  including  graphic  material.  The 
computer  terminals  amenable  to  automated  testing  include  teletypes, 
monochrome  CRT  terminals,  plasma  tube  (PLATO)  terminals,  and  color 
graphics  CRT  terminals.  Computer  terminals  typically  have  integral 
provisions  for  response  keyboards;  microform  display  units  do  not.  All 
devices  listed  above  are  commercially  available  off  the  shelf;  special 
provisions  may  be  required  to  integrate  each  into  a  testing  system  and 
to  interface  each  to  a  test  control  device. 

With  CRT  or  similar  computer  terminals,  test  item  storage  must  be 
in  computer  code,  either  core-resident  or  mass  storage  resident  and 
rapidly  accessible.  The  volume  of  displayable  material  needed  to  sup¬ 
port  a  full  battery  of  adaptive  tests  may  require  hundreds  of  thousands 
of  characters  of  computer  storage. 

Microform  storage  of  test  items  is  more  efficient  but  less  flexi¬ 
ble  than  computer  storage.  Items  may  be  photographed  and  stored  on 


microfilm  rolls,  photographic  slide  magazines,  microfiche,  or  ultra¬ 
fiche.  Slide  magazines  are  bulky  and  cumbersome  and  worth  considering 
only  as  a  prototype.  Microfilm  rolls  are  a  highly  efficient  storage 
medium,  but  the  machinery  needed  to  implement  adaptive  testing  with 
items  stored  on  microfilm  is  expensive  and  inappropriate.  Microfiche 
and  ultrafiche  seem  to  offer  an  acceptable  compromise.  A  single  4-by- 
6-inch  microfiche  can  contain  several  hundred  display  images;  an  ultra¬ 
fiche  of  similar  dimensions  can  hold  about  2,000  images.  Thus,  test 
items  for  a  sizable  battery  of  adaptive  tests  could  be  stored  on  about 
ten  microfiche  or  on  a  single  ultrafiche.  All  that  is  required  for  a 
test  item  display  device  is  the  ability  to  automate  the  microfiche/ 
ultrafiche  reader. 

Automated  microfiche  readers  are  already  commercially  available 
and  can  be  modified  readily  to  serve  as  testing  terminals  by  interfac¬ 
ing  them  to  appropriate  control  devices. 

The  internal  processing  requirements  of  automated  adaptive  test¬ 
ing  may  be  accomplished  by  a  central  computer,  minicomputer,  or  micro¬ 
computer,  entirely  within  today's  state  of  the  art.  System  design 
stands  between  current  development  and  implementation  of  a  computerized 
system  for  adaptive  testing. 

Some  efficiency  or  cost  effectiveness  may  be  gained  by  the  use  of 
special-purpose  microprocessors  to  control  the  test  itself  and  the  test¬ 
ing  equipment.  Again,  such  devices  are  well  within  the  current  state 
of  the  art  in  electronics.  The  equipment  needs  to  be  designed  and  in¬ 
tegrated  into  a  system  for  adaptive  testing. 


Discussion 


Adaptive  testing  involves  selective  administration  of  a  small 
subset  of  a  larger  pool  of  items  that  measure  the  trait  of  interest. 

The  size  of  this  item  pool,  along  with  the  psychometric  characteris¬ 
tics  of  the  constituent  items,  places  limits  on  the  measurement  proper¬ 
ties  of  the  adaptive  test.  Obviously,  the  item  pool  should  be  large 
enough  and  constituted  so  as  to  permit  the  adaptive  tests  to  function 
effectively.  Early  theoretical  research  in  adaptive  testing  suggested 
that  item  pools  had  to  be  large,  ranging  from  one  or  two  to  several 
hundred  or  several  thousand  test  items.  More  recently,  computer  simu¬ 
lation  research  by  Jensema  (1977)  and  other  associates  of  Urry  has  shown 
that  adaptive  tests  can  function  very  well  at  test  lengths  of  5  to  30 
items  and  that  item  pools  containing  50  to  200  items  are  of  sufficient 
size,  provided  that  prescriptions  for  the  psychometric  characteristics 
of  the  test  item  are  met.  These  prescriptions  concern  the  magnitude 
of  the  items'  item  response  model  discrimination  parameters,  the  range 
and  distribution  of  the  item  difficulty  parameters,  and  the  suscepti¬ 
bility  of  the  items  to  random  guessing. 
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Urry  (1974)  has  listed  such  prescriptions  for  items  calibrated 
(with  the  three-parameter  ogive  model)  against  an  ability  scale  on 
which  the  examinee  population  is  distributed  normal  (0,1).  They  in¬ 
clude  item  discrimination  parameters  exceeding  .80,  item  guessing 
parameters  below  .30,  and  a  rectangular  distribution  of  item  difficul¬ 
ties  ranging  approximately  from  -2  to  +2  units  on  a  standard  deviation 
scale.  McBride  (1976b)  suggested  an  even  wider  range  of  item  difficulty 
and  found  that  item  pools  with  100  and  150  items  supported  satisfactory 
measurement  properties  in  their  adaptive  tests.  For  measurements  focus¬ 
ing  on  the  trait  scale  interval  between  -2  and  +2  standard  deviations 
about  the  population  mean,  a  100-item  pool  seems  sufficient  (e.g., 
Schmidt  &  Gugel ,  1975;  McBride,  1976b).  For  measurement  over  a  wider 
interval,  a  wider  span  of  item  difficulty  is  indicated,  along  with  a 
proportional  increase  in  item  pool  size;  see  Lord  (1977)  and  McBride 
(1976b)  for  examples. 

Because  of  the  requisite  size  of  item  pools  for  adaptive  testing 
and  the  prescriptions  concerning  the  needed  psychometric  characteristics 
of  the  test  items,  a  question  of  the  feasibility  of  assembling  adequate 
item  pools  arises.  Large  numbers  of  test  items  used  in  conventional 
tests  will  not  meet  the  discrimination  parameter  criterion  for  inclusion 
in  adaptive  test  item  pools.  Furthermore,  the  wide,  rectangular  dis¬ 
tribution  of  item  difficulty  specified  by  Urry's  prescription  may  be 
difficult  to  satisfy.  In  many  settings  it  may  not  be  feasible  to  con¬ 
struct  adaptive  test  item  pools  from  off-the-shelf  test  items.  However, 
where  large-scale  testing  programs  are  already  in  progress,  the  outlook 
is  better.  Urry  (1974) ,  for  example,  was  able  to  assemble  a  200-item 
pool  for  adaptive  testing  of  verbal  ability  by  screening  about  700  items 
in  15  forms  of  a  U.S.  Civil  Service  Test.  Lord  (1977)  has  made  availa¬ 
ble  for  research  a  pool  of  690  verbal  items  from  obsolete  forms  of  sev¬ 
eral  tests  published  by  the  Educational  Testing  Service. 

In  military  testing,  current  and  obsolete  test  batteries  in  the 
aggregate  contain  hundreds  of  test  items  for  each  of  several  cognitive 
ability  variables  that  have  been  measured  by  military  tests  for  several 
years.  For  example,  test  variables  such  as  word  knowledge,  arithmetic 
reasoning,  and  general  information  have  been  included  in  Army  selection 
test  batteries-*-  through  several  generations  of  tests  and  multiple  forms 
within  each  generation.  Such  tests  can  be  expected  to  contain,  in  their 
various  alternate  forms,  sufficient  numbers  of  test  items  from  which  to 
select  the  items  to  constitute  item  pools  for  adaptive  testing. 

For  test  variables  not  having  a  large  bank  of  items  already  in 
existence,  a  major  item-writing/item-pool  development  program  will  be 
necessary.  Even  for  variables  already  well  represented  in  large  num¬ 
bers  of  test  items,  other  problems  remain  to  be  solved  before  the 

^Examples  include  the  Armed  Forces  Qualifying  Tests  (AFQT) ,  the  Army 
Classification  Battery  (ACB) ,  and  the  current  Armed  Forces  Vocational 
Aptitude  Battery  ( ASVAB) . 


adaptive  testing  item  pools  can  be  assembled.  Reference  here  is  to  the 
problem  of  item  calibration — estimating  the  latent  trait  response  model 
parameters  of  each  item's  characteristic  curve. 


In  a  previous  section,  the  existence  of  computer  programs  for  esti¬ 
mating  item  parameters  was  mentioned.  The  basic  data  required  by  such 
programs  are  the  dichotomously  scored  responses  of  examinees  to  a  mod¬ 
erately  large  number  of  test  items.  Urry  (1977a)  and  Schmidt  and  Gugel 
(1975)  have  reported  research  results  that  suggest  that  the  number  of 
examinees  should  equal  or  exceed  2,000  in  order  to  achieve  accurate  es¬ 
timates  of  the  item  parameters  for  a  three-parameter  item  response  model. 
Presumably  somewhat  smaller  numbers  will  suffice  for  the  simpler  but 
less  general  one-  and  two-parameter  response  models.  The  importaint 
point  is  that  errors  of  parameter  estimation  will  increase  as  either 
or  both  of  the  two  sample  sizes — items  and  persons — decreases. 

In  calibrating  the  test  items  of  large-scale  testing  programs, 
such  as  ACB  and  ASVAB,  access  to  adequately  large  examinee  samples 
should  not  be  a  problem,  since  hundreds  of  thousands  of  examinees  take 
each  form  of  a  battery  annually.  However,  the  item  sample  sizes  are  in 
many  cases  inadequate  by  Urry's  standards.  For  example,  the  longest 
subtest  in  the  current  ASVAB  is  only  30  items.  Most  ASVAB  subtests  are 
shorter.  If  accurate  item  calibration  is  not  possible  using  the  exist¬ 
ing  answer  sheets  from  such  subtests,  then  item  calibration  studies 
will  need  to  include  administration  of  longer  subtests  to  large  numbers 
of  examinees  in  a  testing  program  separate  from  current  operational 
testing.  On  the  other  hand,  if  a  means  can  be  found  that  will  permit 
accurate  item  calibration  based  on  item  responses  to  current  subtests, 
there  will  be  a  substantial  reduction  in  the  expense  and  effort  required 
to  assemble  adaptive  testing  item  pools. 

State  of  the  Art.  For  estimating  item  parameters  under  a  three- 
parameter  response  model,  two  existing  computer  programs  are  appropri¬ 
ate:  OGIVEIA,  described  by  Urry  (1977a);  and  LOGIST,  described  by  Lord 

(1974b).  Item  calibration  research  based  on  OGIVEIA  led  Urry  to  pre¬ 
scribe  test  lengths  of  60  items  and  examinee  samples  of  2,000  as  the 
minimum  values  for  satisfactory  parameter  estimation.  Lord  (1974b) 
recommended  a  similar  examinee  sample  size,  but  made  no  mention  of  the 
requisite  test  length. 

Urry's  program  is  appropriate  for  calibrating  dichotomously  scored 
items  only;  no  provision  is  made  for  item  scores  other  than  right  or 
wrong;  further,  it  explicitly  assumes  a  normal  distribution  of  the 
ability  parameter.  LOGIST  contains  explicit  provision  for  differenti¬ 
ating  unanswered  items  from  those  answered  incorrectly.  It  treats  dif¬ 
ferentially  two  categories  of  unanswered  items:  items  reached  but 
omitted  and  items  not  reached.  Items  not  reached  are  ignored  during 
the  portion  of  the  item  calibration  process  in  which  an  examinee's 
ability  parameter  is  estimated.  Lord  (1974b)  has  suggested  that  this 
feature  of  LOGIST  may  be  useful  for  calibrating  sets  of  test  items  in 
which  not  all  examinees  answer  the  same  items.  Thus  it  may  be  possible 


to  use  LOGIST  to  calibrate  simultaneously  items  from  two  or  more  alter¬ 
nate  forms,  where  a  different  examinee  sample  responds  to  each  form. 
LOGIST  makes  no  assumptions  regarding  the  form  of  the  distribution  of 
ability. 

Two  research  questions  need  resolution  before  adaptive  testing 
item  pools  can  be  constructed  from  existing  test  items.  First,  what 
are  the  effects  of  calibrating  test  items  from  the  answer  sheets  of 
rather  short  tests  (20  to  30  items)?  Second,  if  those  effects  are  not 
favorable,  is  it  feasible  to  calibrate  items  by  pooling  answer  sheets 
from  two  or  more  forms,  each  taken  by  different  examinees,  to  increase 
the  number  of  items  to  a  size  needed  for  satisfactory  calibration? 

These  questions  are  not  readily  amenable  to  answers  based  on  theoreti¬ 
cal  or  mathematical  analysis.  However,  they  may  be  answered  empiri¬ 
cally  by  means  of  simulated  calibration  of  artificial  item  response 
data  along  lines  used  by  Lord  (1975b)  or  by  Schmidt  and  Gugel  (1975). 

A  related  issue  is  one  of  equating  the  scales  derived  from  inde¬ 
pendent  calibrations  of  test  items  measuring  a  common  variable  but  con¬ 
tained  in  different  tests.  This  is  the  same  problem  as  making  item 
parameter  estimates  that  result  from  calibration  of  different  tests  in 
different  examinee  samples  all  have  reference  to  the  same  ability  metric. 
Lord  (1975a)  has  suggested  a  number  of  equating  methods,  based  on  item 
characteristic  curve  theory,  that  are  applicable  to  this  problem.  Some 
of  those  equating  methods  have  distinct  advantages  over  traditional 
equating  methods. 


Advances  in  Measurement  Methodology 

Discussion 


Current  methods  of  measuring  psychological  traits  overwhe) mingly 
use  tests  composed  of  dichotomously  scored  items.  In  ability  measure¬ 
ment,  each  such  item  is  a  task,  chosen  from  the  domain  of  relevant 
tasks,  that  an  examinee  performs  successfully  or  unsuccessfully,  cor¬ 
rectly  or  incorrectly.  Performance  on  each  item  task  is  taken  as  an 
indication  of  the  examinee's  level  of  functioning  on  an  underlying 
ability  trait.  Thus,  the  trait  is  only  indirectly  measured,  using 
item  tasks  that  have  only  imperfect  fidelity  to  the  trait  of  interest. 

For  example,  multiple -choice  vocabulary  test  items  often  are  used  to 
measure  verbal  ability. 

Most  adaptive  testing  research  has  used  the  same  kinds  of  items. 
Adaptive  testing  using  traditional  item  types  represents  an  improvement 
in  the  efficiency  of  measurement  but  no  improvement  in  the  fidelity  of 
the  test  behavior  to  the  trait  of  interest. 

The  usual  media  of  group  test  administration,  paper-and-pencil 
booklets  and  answer  sheets,  necessitated  the  compromise  of  task  fidelity. 
Administration  of  tests  by  computer  terminals,  as  is  common  in  adaptive 
testing  research,  opens  up  the  possibility  of  introducing  whole  new 
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modes  of  stimulus  and  response  to  the  methodology  of  measuring  psycho¬ 
logical  abilities  and  perhaps  of  improving  the  fidelity  between  tests 
and  abilities.  The  implications  of  computerized  test  administration 
for  measurement  are  potentially  vast,  as  is  the  number  of  research 
issues . 


The  basic  issue  is  this:  How  can  the  capability  of  the  computer 
be  exploited  to  yield  more  and  better  test  information  about  individual 
examinees?  This  subsumes  other  questions,  such  as:  Can  test  stimuli 
be  enriched,  and/or  response  modes  expanded,  to  achieve  improved  mea¬ 
sures  of  current  ability  variables?  Can  nontraditional  ability  variables 
be  identified  and  measured,  yielding  improvements  in  test  fidelity  and 
validity?  Can  advances  in  measurement  procedures  be  made  that  are  ac¬ 
companied  by  advances  in  practical  utility? 


State  of  the  Art 


A  comprehensive  review  of  the  current  status  of  research  in  •  in. 
issues  is  beyond  the  scope  of  this  paper.  Only  a  cursory  overvi'  W  v..li 
be  attempted. 

For  measuring  traditional  ability  variables,  expanded  stimulus  and 
response  modes  are  made  possible  by  computer  administration.  On  the 
response  side,  several  different  approaches  are  possible.  One  is  to 
permit  on-line  polychotomous  scoring  rather  than  dichotomous  scoring  of 
traditional  multiple-choice  type  items:  Samejima  (1969)  and  Bock  (197d) 
have  developed  psychometric  procedures  to  support  such  item  scoring 
methods.  A  more  sophisticated  approach  is  to  accept  natural  language, 
or  free  responses,  to  traditional  test  item  stimuli;  the  examinees 
could  type  their  answers  in  full  on  a  typewriter-like  keyboard  rather 
than  choose  multiple-choice  answers.  Natural-language  processing  com¬ 
puter  programs  would  be  used  to  check  free-form  responses  against  the 
nominal  correct  answers  and  thus  to  score  item  performance  (see,  for 
example.  Vale  &  Weiss  (1977)). 

Traditional  test  stimuli  are  static  and  usually  monochrome;  this 
is  necessitated  by  the  printed  medium  in  use.  Presenting  stimuli  at 
computer  terminals  makes  it  possible  to  introduce  multicolored  stimuli 
and  to  use  dynamic  test  items.  For  example,  the  examinee  may  be  per¬ 
mitted  to  "rotate"  in  space  a  three-dimensional  figure  presented  on  a 
CRT  screen  to  facilitate  visualization.  Cory  (1978)  has  experimented 
with  the  use  of  fragmentary  pictures  as  test  item  stimuli,  with  the 
examinee  able  to  increment  the  proportion  of  the  picture  presented. 

Computer  administration  has  been  suggested  as  a  means  of  measuring 
ability  variables  not  convenient  to  test  in  paper-and-pencil  format 
(Weiss,  1975).  This  will  permit  test  designers  and  users  to  transcend 
the  limits  of  traditional  ability  tests  that  measure  verbal  ability 
and  logical,  sequential  analytical  functions  associated  with  the  left 
hemisphere  of  the  brain.  Spatial  perception,  short-term  memory,  judg¬ 
ment,  integration  of  complex  stimuli,  cognitive  information-processing, 


and  other  complex  abilities  may  be  measurable  by  exploiting  the  power 
and  flexibility  of  the  computer  terminal  as  a  testing  medium.  Cory 
(1978)  has  conducted  exploratory  research  investigating  computer  ad¬ 
ministration  of  some  novel  item  types.  Valentine  (1977)  has  discussed 
preliminary  efforts  directed  toward  computerized  assessment  of  certain 
psychomotor  abilities.  Rimland  and  his  associates  (Lewis,  Rimland,  £. 
Callaway,  1977)  have  used  a  computer  to  facilitate  measurements  of 
brain  activity  that  may  be  related  to  ability  variables.  Rose  (1978) 
is  investigating  measures  of  cognitive  information  processing  skills 
using  dynamic  computer-administered  problems  as  test  items.  All  of  the 
efforts  just  listed  have  shown  some  promise,  but  they  must  be  considered 
as  exploratory  efforts  that  may  or  may  not  lead  to  developments  that 
supplant  or  complement  traditional  methods  of  measuring  psychological 
abilities . 
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